Realization and Evaluation of Measurement Feedback Coherent Ising Machines for Combinatorial Optimization Problems

Realization and Evaluation of Measurement Feedback Coherent Ising Machines for
Combinatorial Optimization Problems ૊߹ͤ࠷దԽ໰୊ͷͨΊͷ ଌఆϑΟʔυόοΫܕίώʔϨϯτɾΠδϯάϚγϯ ͷ࣮ݱͱධՁ Yoshitaka Haribara Advisor: Prof. Kazuyuki Aihara January 23, 2018 Department of Mathematical Informatics University of Tokyo

Thesis Structure 1.Introduction 2.Maximum Cut Problem 3.Degenerate OPO 4.Coherent Ising
Machine 5.Numerical simulation 6.Experiment 1: Results and Validation 7.Experiment 2: Performance Evaluation 8.Conclusion 2 [1,2,5] Ref. [3,4,6]

1. Introduction

Processor Trends 4 • Uniprocessor performance declined around 2003 (from
52% to 22% / year) • GPU (cf. NVIDIA Tesla V100: 5120 core) • deep learning [1] / scientiﬁc simulation (molecular dynamics [2], visualizing CT scan [3], astrophysics many body system [4]) • Field programmable gate array (FPGA; programmable hardware) • Image processing (medical[5], broadcast[6]) / Datacenter (search[7]) [Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2015 by K. Rupp] 1980 1990 2000 2010 1 10 100 1000 104 105 Year #Cores Single-thread performance (SPECint x 103) Freq.(MHz) Power (W) [1] A. Krizhevsky, et al., NIPS 1097 (2012).; Y. LeCun, et al., Nature 521, 436 (2015). [2] A. Götz, et al., Journal of chemical theory and computation 8, 1542 (2012). [3] O. Kutter, et al., Computer Methods and Programs in Biomedicine 94, 250 (2009). [4] L. Nyland, et al., GPU gems 3, 677 (2007). [5] https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-medical.pdf [6] https://www.altera.com/en_US/pdfs/literature/wp/wp-brdcst0306.pdf [7] A. Putnam, et al. Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE (2014).

Google+UCSB [http://web.physics.ucsb.edu/~martinisgroup/ photos/Linear9XmonSurfaceCode.jpeg] Traffic flow in Beijing (from Ref. [6])
TPU (Google) [https://cloud.google.com/blog/big-data/ 2017/05/an-in-depth-look-at-googles- first-tensor-processing-unit-tpu] Problem Specific Processors • Application specific IC (ASIC) • TPU (1st gen. for prediction) in neural network [1,2] • 8 bit quantization to reduce the number of multiplier • Matrix (systolic array) and nonlinear activation unit • (Universal) Quantum Computer (n ≧ 49 [3]) • Quantum simulation for Fermionic structures in chemistry and material science [4] • Quantum Annealer (Ising problem) • Machine learning to find Higgs Boson [5] • Traffic flow optimization [6] 5 [1] https://cloud.google.com/blog/big-data/2017/05/an-in- depth-look-at-googles-first-tensor-processing-unit-tpu [2] D. Silver, et al., Nature 550, 354 (2017). [3] J. Kerry, et al., Nature 519, 66 (2015) for n = 9. [4] J.R. McClean, et al., arXiv:1710.07629 [quant-ph] (2017). [5] A Mott, et al., Nature 550, 375 (2017). [6] F. Neukart, et al., arXiv:1708.01625 [quant-ph] (2017). D-Wave [https://www.dwavesys.com/ resources/media-resources]

Complexity Classes NP P = NP = NP-complete P NP-complete
NP-hard NP-hard P ≠ NP P = NP Computational Complexity • Ising problem (NP-hard) • minimize: • Quantum Annealers [1] • D-Wave (n = 2000, chimera graph) [2] / NEDO • Ising Processor • Hitachi (n = 20,000 CMOS and FPGA) [3] / Fujitsu (FPGA) [4] Ising machines 6 1k-spin sub-array 780 × 380 μm2 4 mm 3 mm 1k spin circuits SRAM I/F Fig. 5—Ising Chip Photograph. The chip has 20,000 spin circuits in a 3-mm × 4-mm = 12-mm2 area. I/F: interface Hitachi [http://www.hitachi.com/rev/ pdf/2016/r2016_06_110.pdf] H = X i6=j Jij i j, i 2 { 1, 1} (i = 1, . . . , n) HQA = X i6=j Jij z i z j (t) n 1 X i=0 x i Ising Transverse ﬁeld [1] T. Kadowaki and H. Nishimori, Phys. Rev. E 58, 5355 (1998). [2] https://www.dwavesys.com/sites/default/ﬁles/D- Wave%202000Q%20Tech%20Collateral_0117F2.pdf [3] M. Yamaoka, et al., IEEE Journal of Solid-State Circuits 51, 303; C. Yoshimura, et al., IEEE CANDAR 436. [4] S. Matsubara, et al., In Conference on Complex, Intelligent, and Software Intensive Systems, 432. K4,4 in chimera [R. M. Karp, Complexity of Computer Computations, (Plenum, New York, 1972), pp. 85–103]

N = 4, 16 CIM [A. Marandi, Z. Wang, K.
Takata, R. L. Byer, and Y. Yamamoto, Nat. Photonics 8, 937 (2014)] [K. Takata, A. Marandi, R. Hamerly, Y. Haribara, et al., Scientiﬁc Reports 6, 34089 (2016)] • Free space coherent Ising machines of N = 4 [A. Marandi, et al.] and N = 16 [K. Takata, et al.]. N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* TR * Tcavity =4TR* Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO OPO OPO1* OPO1* π* MAX&CUT*Problem*on*OPO N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* •  OC*and*IC*~*4%* •  N=4*MAX&CUT* –  Expected*Result:*2*OPOs*in*|0〉,*2*OPOs*in*|π〉 •  Measurement:*Unequal&Arm*Interferometer* * Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Delay*3*=*3TR ** Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO1* OPO2* OPO3* OPO4* π* π* π* π* π* π* MAX&CUT*Problem*on*OPO*Network* November*2014* OPO*Ising*Machine* 8* 1TR 2TR 3TR periment* Delay*1* Delay*2* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Pulse interval For N = 16, 1TR, 8TR, 15TR delay lines 㱺 Complete graph K4 7 Optical Delay Lines

My Contribution • Proposed a scalable architecture of coherent Ising
machine with DOPO and FPGA, which enabled to implement arbitrary topology of graphs. • Mathematical modeling with CSDE and numerical simulations with reasonable parameters which led to the experimental success • Benchmark on MAX-CUT instances up to n ≦ 2,000 (experiment) and n ≦ 20,000 (simulation) against existing algorithms

2. Maximum Cut Problems

CUT = 4 Maximum Cut (MAX-CUT) • Given a graph
G(V, E) with weight wij. • Find the largest cut in G. • Cut is a partition of V into 2 subsets. • #Cut = total weight of crossing edges. • The term (1 - σi σj)/2 only counts the crossing edge, where σ = {+1, -1}. • Equivalent to anti-ferromagnetic Ising problem (minimizing Σ wij σ i σ j ). +1 +1 -1 -1 10 [R. M. Karp, Complexity of Computer Computations, (Plenum, New York, 1972), pp. 85–103]

Algorithms for MAX-CUT -0.87856: Semi-deﬁnite Programming (SDP) relaxation [Goemans and
Williamson, 1995] -0.53112: Variant of Spectral partitioning (minimum eigenvalue) [Trevisan, 2008] -0.5 + ε (ε > 0): Random Walk [Kale, et al., 2010] -0.5: Greedy [Sahni and Gonzalez, 1976] Approx. ratio -Breakout Local Search [Benlic, et al., 2013] -Simulated Annealing [Kirkpatrick, et al., 1983] -SG3 (Variant of Greedy) [Kahruman, et al., 2007] UGC [2] 0.94117 PCP [1] NP-hard P Planar [Orlova and Dorfman, 1972; Hadlock, 1975], Weakly bipartite [Grötschel, et al., 1981], Positive weighted without long odd cycle [Grötschel, et al., 1984], Integer weight (≦ N) with ﬁxed genus [Galluccio, et al., 2001] Graphs solvable in P Heuristics 11 [1] S. Arora and S. Safra., JACM 45, 70 (1998). [2] S. Khot, STOC 767 (2002).

SDP relaxation • A continuous-valued relaxation problem of MAX-CUT: •
The matrix becomes Positive Semi-Deﬁnite. • The optimal value of this reduction problem (class P) is called SDP Upper Bound. • A solution for the original MAX-CUT is obtained by random projections. Expectation values > 0.87856 of OPT (MAX-CUT). vi vj [M. X. Goemans and D. P. Williamson, J. ACM 42, 1115 (1995)] 12

Simulated Annealing (SA) • Meta-heuristic : gradually decreasing the “temperature”,
i.e., noise, then converge to the ground state in inﬁnite time [Geman and Geman, IEEE PAMI 6, 721 (1984)] • Performance is heavily dependent on : temperature scheduling (log), data structure (vector or array), multi-spin coding (below) 0 1 0 0 0 1 … 1 1 0 1 0 1 0 1 0 1 0 … 1 0 0 1 1 ↓ bitwise XOR 1 1 1 0 1 1 … 0 1 0 0 1 64-bit multi-spin coding spin σj weight wij (wij=±1 complete graph) → bit count (SIMD instruction) where wij ^ σj (Courtesy of S. Tamate & T. Sonobe @ NII) 13 [Kirkpatrick, et al., Science 220, 671 (1982)] Pﬂip = ⇢ 1 ( E  0) exp( E) ( E > 0)

BLS: Breakout Local Search • Combination of local search &
perturbation • 3 types of perturbation: uni-directional, bi-directional, random • Constant time spin update with bucket data structure 1. Local search (steepest descent) 2. Perturbation (forced spin ﬂipping) P 1-P Q 1 - Q → ⁶ random Perturbation 14

Backet Data V1 V2 V3 V4 V5 V6 Cut 1
-1 1 -1 1 1 -1 1 -1 ɾ ɾ ɾ ɾ ɾ ɾ 1 2 3 4 5 6 gmax : 3 2 1 0 -1 -2 -3 : gmin gmax : 3 2 1 0 -1 -2 -3 : gmin S1 S2 S1 S2 • Gain list for each side of the partition and vertex address list enable constant O(1) time spin conﬁguration update. 6 4 5 2 1 3 maxgain maxgain 15

Greedy Algorithm and its Variant • Input : A graph
with edge weights wij. • output : A cut {S1, S2} and the cut value cut(S1, S2). 1. ∀i ∈ V \{S1 ∪ S2 } , score(i ) = max{ w(i, S1 ), w(i, S2 )} SG3: score(i ) = |w(i, S1 ) - w(i, S2 )| 2. cut += score(imax) x y S1 S2 ∀i ∈ V’ x y S1 S2 ∀i ∈ V’ imax “best in” [S. Kahruman, et al., Int. J. Comp. Sci. and Eng. 3, 211 (2005)]

Neural Network based Optimization Algorithms • Hopfield Network (HN) [J.
J. Hopfield, PNAS 79, 2554 (1982)] • Update rule : majority vote: • Derandomized SA • Hopfield-Tank NN (HTNN) [Hopfield and Tank, Science 233, 625 (1986)] • Parallel implementation with a many-core processor sigmoid sgn 0 @ X j Jijxj bj 1 A

QUBO-MILP • MAX-CUT → UBQP/QUBO → IBM CPLEX (MILP). Mixed-
integer programming with a linear objective function. • zij = xi xj when xi ,xj ∈ {0, 1}, and deﬁne the convex hull of {(xi , xj , xixj ) | xi ,xj ∈ {0, 1} }. (Rooted Correlation Semimetric Polytope) • LP-based branch&cut minimize : X i<j Qijzij + n X i=1 Qiixi subject to : zij  xi, zij  xj, xi + xj zij  1, (8i < j) zij 0, x 2 {0, 1}n

3. Degenerate OPO

Optical Parametric Oscillator (OPO) • In a degenerate optical parametric
oscillator (DOPO), a nonlinear crystal is placed in a cavity (mirrors). • Frequency down conversion via molecular vibration occurs. The input with frequency ωp (pump) is converted to ωs (signal). • Degeneracy: • OPO can be used as a physical random number generator with quantum noise [A. Marandi, et al., Optics express 20, 19322 (2012)] [C. Nabors, et al., JOSA B 7, 815 (1990)] frequency phase Self-phase-locking 20 ( ) Pump Signal Nonlinear crystal Pump laser

Formulation of a DOPO • System Hamiltonian H ↓ Liouville
von-Neumann Equation • Quantum Master Equation for density matrix ρ ↓ Phase Space Expansion • Fokker-Planck Equation for Wigner quasi-distribution W(α) ↓ Ito Formula • Langevin Equation: phase space variable α 21 i~ @ˆ ⇢ @t = [ ˆ H, ˆ ⇢] [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)] ˆ ⇢ = Z d e ⇤ˆ a ˆ a† ⇢Z e ˆ a⇤ ⇤ˆ aW(↵) d↵

Pump Field System Hamiltonian Signal Field Parametric Coupling External Pumping
Irreversible Interaction with the Reservoir where H = Hfree + Hinteraction + Hpump + Hsys-res Hfree = ~!sˆ a† s ˆ as + ~!pˆ a† p ˆ ap Hinteraction = i~  2 (ˆ a†2 s ˆ ap ˆ a† p ˆ a2 s ) Hpump = i~("ˆ a† p "⇤ˆ ap) Hsys-res = ~(ˆ as ˆ† Rs + ˆ Rsˆ a† s + ˆ ap ˆ† Rp + ˆ Rpˆ a† p ) ( ) Pump Signal 22 : photon creation operator : photon annihilation operator ˆ a ˆ a† [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)]

• Using the Liouville von-Neumann Equation • Master equation for
the density matrix ρ • Born and Markov approximation, Adiabatic elimination • No analytic solution in the κ ≠ 0 case Master Equation i~ @ˆ ⇢ @t = [ ˆ H, ˆ ⇢] 23 @ˆ ⇢ @t = i!s[ˆ a† s ˆ as, ˆ ⇢] + i  2 [ˆ a†2 s ˆ ap ˆ a† p ˆ a2 s , ˆ ⇢] + s 2 (2ˆ as ˆ ⇢ˆ a† s ˆ a† s ˆ as ˆ ⇢ ˆ ⇢ˆ a† s ˆ as) + s¯ ns(ˆ as ˆ ⇢ˆ a† s + ˆ a† s ˆ ⇢ˆ as ˆ a† s ˆ as ˆ ⇢ ˆ ⇢ˆ asˆ a† s ). [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)]

Fokker-Planck Equation • Expanded with Wigner quasi-distribution W(α) • Fokker-Planck
Eq. for a single DOPO in truncated Wigner representation • Note: • Truncation (dropping the cubic derivatives) is valid in the small quantum noise limit • Wigner representation can describe the nonclassical ﬁeld @W @t = ⇣ 2 + i! ⌘ @ @↵ ↵ + ⇣ 2 i! ⌘ @ @↵⇤ ↵⇤ + (¯ n + 1 2 ) @2 @↵@↵⇤ W 24 [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)] ˆ ⇢ = Z d e ⇤ˆ a ˆ a† ⇢Z e ˆ a⇤ ⇤ˆ aW(↵) d↵

• Complex amplitude (c + is) ∈ is described as
the SDE: Normalized pump rate p (pth = 1), Steady state amplitude As Langevin Equation Squeezed vacuum Phase Sensitive Ampliﬁcation Squeezed states Pitchfork bifurcation Independent Gaussian noise (μ=0, σ=1) loss saturated gain [P. Kinsler and P. D. Drummond, Phys. Rev. A. 43, 6194 (1991)] 25

Ising problem Coherent Ising Machine (CIM) Objective function Whole loss
of DOPO network Spin variable DOPO phase Adjacency matrix Coupling constant DOPO network • Feedback (injection) term: • Effective potential : detected amplitude : coupling constant [Z. Wang, A. Marandi, K. Wen, R. L. Byer, and Y. Yamamoto, Phys. Rev. A 88, 063853 (2013)] -1.0 -0.5 0.5 1.0 -0.004 -0.002 0.002 26

4. Coherent Ising Machine [Y. Haribara, S. Utsunomiya, and Y.
Yamamoto, Entropy 18, 151 (2016)]

N = 4, 16 CIM [A. Marandi, Z. Wang, K.
Takata, R. L. Byer, and Y. Yamamoto, Nat. Photonics 8, 937 (2014)] [K. Takata, A. Marandi, R. Hamerly, Y. Haribara, et al., Scientiﬁc Reports 6, 34089 (2016)] • Free space coherent Ising machines of N = 4 [A. Marandi, et al.] and N = 16 [K. Takata, et al.]. N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* TR * Tcavity =4TR* Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO OPO OPO1* OPO1* π* MAX&CUT*Problem*on*OPO N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* •  OC*and*IC*~*4%* •  N=4*MAX&CUT* –  Expected*Result:*2*OPOs*in*|0〉,*2*OPOs*in*|π〉 •  Measurement:*Unequal&Arm*Interferometer* * Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Delay*3*=*3TR ** Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO1* OPO2* OPO3* OPO4* π* π* π* π* π* π* MAX&CUT*Problem*on*OPO*Network* November*2014* OPO*Ising*Machine* 8* 1TR 2TR 3TR periment* Delay*1* Delay*2* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Pulse interval For N = 16, 1TR, 8TR, 15TR delay lines 㱺 Complete graph K4 28 Optical Delay Lines

Measurement Feedback CIM (proposed) • To stabilize the O(N) delay
lines is experimentally hard for large N • Multiple DOPO pulses in a ﬁber is coupled by a FPGA circuit • Stochastic differential equation for each DOPO signal amplitude: SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity discretize in time 29 gain I/O (without noise) Time evolution g(c) = c p 1 + c2 c(t) = s 1 1 + exp( t) [Y. Haribara, S. Utsunomiya, and Y. Yamamoto, Entropy 18, 151 (2016)] dci = [( 1 + p c2 i s2 i )ci + N X j=1 ⇠ij ˜ cj] dt + 1 As r c2 i + s2 i + 1 2 dWci

5. Numerical Simulations [Y. Haribara, S. Utsunomiya, K. Kawarabayashi, Y.
Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017)]

Working Equations ˜ ci = cout i p Tmes =
ci r 1 Tmes Tmes fi As Measured Value SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity 0 20 40 60 80 100 1.0 0.5 0.0 0.5 1.0 Normalized time Number of round trips DOPO signal amplitude ci 33 34 35 36 0.80 0.85 0.90 0.95 1.00 1.05 1.10 A C B A B C [Y. Haribara, S. Utsunomiya, and Y. Yamamoto, Entropy 18, 151 (2016)] [Y. Haribara, S. Utsunomiya, K. Kawarabayashi, Y. Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017)] A B C dci = [(p c2 i s2 i )ci] dt + 1 As r c2 i + s2 i + 1 2 dWci dsi = [( p c2 i s2 i )si] dt + 1 As r c2 i + s2 i + 1 2 dWsi ci(t + t) 7! p 1 Tmesci(t) + p Tmes fi As si(t + t) 7! p 1 Tmessi(t) + p Tmes fi As ci(t + t) 7! p 1 Tinjci(t) + p Tinj⇠ n X j=1 Jij ˜ cj si(t + t) 7! p 1 Tinjsi(t)

Parameter sweep (g14) Mean Best 0.2 0.4 0.6 0.8 1.0
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 Pump rate p Coupling constant 2850 2870 2890 2910 0.2 0.4 0.6 0.8 1.0 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 Pump rate p Coupling constant 2890 2910 2930 2950 100 trials for each parameter

Parameter sweep (K100) Mean Best 100 trials for each parameter

Simulation results g22, g39 in G-set • g22, |V| =
2000, |E| = 19990 (density 1%), random graph • Gradually pumped from p = 0.9 to p = 1.3 linearly. • Coupling constant ξ = -0.02/√<k> = 0.004 • g39, |V| = 2000, |E| = 11778 (density 0.59%), scale-free with ±1 weight • Gradual pump (linear: from p = 0.9 to 1.3) • Coupling constant ξ = -0.05/√<k> = 0.0042 CUT Ratio SDP UB 14136 — BestKnown 13359 0.945 SA 13351 0.944 Best 13081 0.925 Mean 12997.99 0.919 GW-SDP 12993 0.919 CUT Ratio SDP UB 2877 — BestKnown 2408 0.946 SA 2379 0.943 Best 2273 0.931 Mean 2228.69 0.926 GW-SDP 2200 0.923

Results for G-set (sparse) • For the random graphs in
G-set, the number of edges m are independent from n, which results in the constant scaling for SA as O(|E|) • GW (PD relative gap 10-3), best in 100 trials of SA and CIM with hysteretic optimization (estimated time) on Intel Xeon X5650 (2.67 GHz), 94 GB RAM. • MILP solved by CPLEX-MILP (mixed integer linear programming) [Ikuta, et al., 2015], BLS [Benlic, et al., 2013] CIM SA GW 800 2000 5000 7000 10000 0.001 0.1 10 1000 105 Problem size N Computation time s 35 Topology GW-SDP n roundings MILP CIM w/ hysteretic SA BLS Comp. time 2 s - 105 s 12 s - 105 s 0.05 s 0.05 s 1 s - 104 s Random 0.9273 0.9116 0.9448 0.9471 0.9491 Scale-free 0.9288 0.9269 0.9451 0.9510 0.9547 Toroidal 0.9375 0.9618 0.9511 0.9548 0.9616 All 0.9303 0.9349 0.9464 0.9502 0.9540 (Random graph)

Computation time scaling on complete graphs • GW : interior
point method for solving SDP → O(N 3.5) • SA : Constant Monte Carlo steps × avg. degree → O(N 2) • CIM : Optical time-scale of DOPO turn-on delay time () CIM (simulation) SA GW SG3 BLS 40 160 800 4000 20000 10-4 0.01 1 100 104 106 Problem size N Computation time (s) () 200 m fiber 2 km fiber 20 km fiber 2 GHz clock 20 GHz clock 200 GHz clock 100 1000 104 105 106 107 10-5 10-4 0.001 0.010 0.100 1 Problem size N Computation time (s) Graph order N Graph order N Cavity length Pulse repetition Note for SA: N = 105 㱺 3 sec for 20 MCS (multi-core CPU)

6. Experimental Results and Validation [P. L. McMahon*, A. Marandi*,
Y.haribara, et al., Science 354, 614 (2016)]

Machine conﬁgurations 38 Stanford NTT N Number of pulses 100
(+60 for stabilization) 2048 (+2834 for stabilization) Round trip time (Fiber length) 1.6 μs (320 m) 5 μs (1 km) Pulse repetition freq. (Pulse interval) 100 MHz (= 10 ns) 1 GHz (= 1 ns) FPGA Xilinx Virtex-6 (SX475T) 2x Xilinx Virtex-7 (VX690T) Connection arbitrary topology with Jij = {+1, 0, -1} Current operational scheme Gradual coupling Gradual pumping [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)] [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)] SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity Chap. 6 Chap. 7

Single Trial N = 16 Cubic graph (Möbius ladder) N
= 100 Random graph (Erdös-Rényi) and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injection) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is realized. Details are provided in the supplementary materials (26). imental schematic of a measurement-feedback–based coherent Ising machine. n–multiplexed pulsed degenerate optical parametric oscillator is formed by a nonlinear dically poled lithium niobate (PPLN)] in a fiber ring cavity containing 160 pulses. A fraction e is measured and used to compute a feedback signal that effectively couples the ependent pulses in the cavity. IM, intensity modulator; PM, phase modulator; LO, local G, second-harmonic generation; FPGA, field-programmable gate array. A Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 150 -600 -400 -200 0 200 400 600 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 Computation Time (µs) Computation Time (µs) 0 50 100 150 200 Graph A er of Problem Instances 400 600 800 1000 All 4060 16-vertex cubic graphs Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy Graph A 100% 38% Graph C 58% Graph B Exact Solution Exact Solution Exact Solution on November 6, 2 http://science.sciencemag.org/ Downloaded from 0 50 100 150 200 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Round trip number in-phase amplitude 0 50 100 150 200 0 5 10 15 20 Round trip number Cut ments Jij ) and the length-N vector h (with elements hi ). We have realized a system with a scalable architecture that uses measurement feedback in place of a network of optical delay lines [which were used in initial, low-connectivity, nonre- programmable demonstrations of the concept (18, 24, 25)]. Our 100-spin Ising machine allows connections between any spin and any other spin and is fully programmable. We show that measurement-feedback–based OPO Ising machines can solve many different Ising problems, and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injection) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is realized. Details are provided in the supplementary materials (26). ack–based coherent Ising machine. etric oscillator is formed by a nonlinear cavity containing 160 pulses. A fraction ck signal that effectively couples the ulator; PM, phase modulator; LO, local ammable gate array. Roundtrip Number 50 100 150 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 mputation Time (µs) Computation Time (µs) 50 100 150 200 ph A Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy on November 6, 2016 http://science.sciencemag.org/ Downloaded from between qubits remains a major challenge (15), with important implications for the efficiency of AQC/QA systems (16). Networks of coupled optical parametric oscil- lators (OPOs) are an alternative physical system, with an unconventional operating mech- anism (17–20), for solving the Ising problem (21, 22) and by extension many other combinatorial optimization problems (23). For- mally, the N-spin Ising problem is to find the configuration of spins si ∈ f−1; þ1g (i = 1, ..., N) that minimizes the energy function H ¼ − X 1≤i < j ≤N Jij si sj − X 1≤i ≤N hi si , where the particular problem instance being solved is specified by the N × N matrix J (with elements Jij ) and the length-N vector h (with elements hi ). We have realized a system with a scalable architecture that uses measurement feedback in place of a network of optical delay lines [which were used in initial, low-connectivity, nonre- programmable demonstrations of the concept (18, 24, 25)]. Our 100-spin Ising machine allows connections between any spin and any other spin and is fully programmable. We show that measurement-feedback–based OPO Ising machines can solve many different Ising problems, and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injection) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is realized. Details are provided in the supplementary materials (26). Fig. 1. Experimental schematic of a measurement-feedback–based coherent Ising machine. A time-division–multiplexed pulsed degenerate optical parametric oscillator is formed by a nonlinear crystal [periodically poled lithium niobate (PPLN)] in a fiber ring cavity containing 160 pulses. A fraction of each pulse is measured and used to compute a feedback signal that effectively couples the otherwise-independent pulses in the cavity. IM, intensity modulator; PM, phase modulator; LO, local oscillator; SHG, second-harmonic generation; FPGA, field-programmable gate array. Graph A Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 150 -600 -400 -200 0 200 400 600 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 Computation Time (µs) Computation Time (µs) 0 50 100 150 200 Graph A em Instances 600 800 1000 All 4060 16-vertex cubic graphs Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy f Runs 60 80 100 100% State Energy 58% Exact Solution Exact Solution xact Solution RESEARCH | REPORTS on November 6, 2016 http://science.sciencemag.org/ Downloaded from the M energ Th (cubi Fig. 2 matr Fig. 4. Results with various- size and various-density random graphs. (A) Observed probability of obtaining a solution whose cut size is at least x% of the global optimum (maximum cut), as a function of graph size N, for random cubic graph instances. Error bars indicate 1 SD, which is dominated by the difference in difficulty between the various problem instances. (B) The runtime that would be required to obtain a solution of a particular accuracy with 99% probability. (C) The evolution of the in-phase components ci of the N = 100 OPO pulses as a function of the computation time, for a single run with the graph shown in the (D) inset. (D) The graph cut size achieved as a function of the computation time. (Inset) The graph being solved. (E) Observed success probability of obtaining a solution with a particular accuracy as a function of the density of edges in the graph. Experiments were performed on randomly generated N = 100-vertex graphs Runtime to obtain 99% Success Probability (s) 0 10-4 10-3 10-2 10-1 100 Rand Graph Size N=|V | Success Probability (%) 0 20 40 60 80 100 0 20 40 60 80 100 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy Random cubic graphs Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 -600 -400 -200 0 200 400 600 0 50 100 150 Graph D: |V|=100, |E|=495 A B C Computation Time (µs) Computation Time (µs) Roundtrip Number 0 50 100 0 100 200 300 Ising Energy -200 -100 0 100 200 300 400 0 50 100 150 MAX CUT / Ground State Energy Graph Cut Size Graph D: |V |=100, |E|=495 D Ising Energy -160 -140 -120 Ising Energy -100 -80 -60 Ising Energy -20 0 20 Fig. 3. Results with various-size Möbius ladder graphs. (A) Observed probability of obtaining a ground state of the Möbius ladder graph in a single run, as a function of the size N of the graph. Multiple 100-run batches were performed for each graph size to obtain the standard deviations, which are shown as error bars. (B to D) Histograms of obtained solutions in 100 runs for the graphs shown in the insets. 0 50 100 150 200 -2 -1 0 1 2 Round trip number In-phase amplitude 0 50 100 150 200 220 240 260 280 300 320 340 Round trip number Cut 616 4 NOVEMBER 2016 • VOL 354 ISSUE 6312 solution whose cut size is at least x% of the global optimum (maximum cut), as a function of graph size N, for random cubic graph instances. Error bars indicate 1 SD, which is dominated by the difference in difficulty between the various problem instances. (B) The runtime that would be required to obtain a solution of a particular accuracy with 99% probability. (C) The evolution of the in-phase components ci of the N = 100 OPO pulses as a function of the computation time, for a single run with the graph shown in the (D) inset. (D) The graph cut size achieved as a function of the computation time. (Inset) The graph being solved. (E) Observed success probability of obtaining a solution with a particular accuracy as a function of the density of edges in the graph. Experiments were performed on randomly generated N = 100-vertex graphs with fixed numbers of edges. Error bars indicate 1 SD. Graph Size N=|V | Success Probability 0 20 40 60 80 0 20 40 60 80 Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 -600 -400 -200 0 200 400 600 0 50 100 150 Graph D: |V|=100, |E|=495 C Computation Time (µs) 1 2 3 Graph Cut Size D Computation Time (μs) Simulation Experiment Simulation Experiment 39 ground state [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

All cubic graphs of N = 16 • MAX-CUT in
cubic graph is NP-hard [E. Halperin, D. Livnat, and U. Zwick (SODA 2002)]. • 4060 instances for n = 16. 100 trials for each instance. 300 round trips for each trial. • Gradual coupling with dξ/dt = -0.001 and pump rate p = 0.88. • Pearson correlation coefﬁcient ρ = 0.80, (cf. simulations between different parameters ρ ~ 0.9) 40 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Simulation Experiment 0 20 40 60 80 100 Simulation Experiment 0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 600 700 Success probability Number of Instances Number of instances A B 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Simulation Experiment 0 20 40 60 Simulation Experiment #Instances [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

Möbius ladder Möbius ladder 41 N = 4n ξ =
-0.02 N = 4n+2 Ê Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ 5 10 15 20 25 30 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Problem size N Success rate Ground State Probability (%) Moebius Ladder Graphs Ising Energy -160 -140 -120 Ising Energy -100 -80 -60 Ising Energy Number of Runs -20 0 20 0 20 40 60 80 100 N=100 N=52 N=8 100% 48% 21% 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 Graph Size N =|V | MAX CUT / Ground State Energy Fig. 3. Results with various-size Möbius ladder graphs. (A) Observed probability of obtaining a ground state of the Möbius ladder graph in a single run, as a function of the size N of the graph. Multiple 100-run batches were performed for each graph size to obtain the standard deviations, which are shown as error bars. (B to D) Histograms of obtained solutions in 100 runs for the graphs shown in the insets. RESEARCH | REPORTS N = 4n [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)] Simulation Experiment

Computation Time Scaling • Time to 99% success (ﬁnd ground
state at least once in 99% probability). • Tcomp x log(1-0.99) / log(1-psuccess) (minimum k for (1-psuccess )k < (1-0.99) is required) • Exponential time to the ground states. Almost constant time to solution ≦ .94 OPT 42 The same spin configuration {si } represents a cut of size CðfsigÞ ¼ −1 2 X 1≤i <j ≤N Jij − 1 2 HðfsigÞ; there is a direct mapping between the ener- gies in the Ising problem and the cut sizes in the Max-Cut problem, and minimizing the Ising energy maximizes the cut. The unweighted, undirected Möbius ladder (cubic) graph with N = 16 vertices is shown in Fig. 2A. We programmed the corresponding J matrix into our feedback electronics and ran Graph Size N=|V | Runtime to obtain 99% Success Probability (s) 0 20 40 60 80 100 10-4 10-3 10-2 10-1 100 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Random cubic graphs | 80 100 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy Solution accuracy B Ising Energy 60 -140 -120 21% erved probability of obtaining a the size N of the graph. Multiple dard deviations, which are shown the graphs shown in the insets. on November http://science.sciencemag.org/ Downloaded from ¬ ¬ [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

Eﬀect of Edge Density • Lower success rate in dense
graphs are due to insuﬃcient pump (or short computation time) • For the dense graphs, it is easier to modulate the pump: gradual pumping 43 Graph Edges |E| Graph Edge Density d=2|E|/(|V|(|V|-1)) Success Probability (%) 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy 0 2000 4000 0 20 40 60 80 100 0 0.5 1 Simulation Experiment (p = 0.88) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) (p = 0.7) (p = 0.9) (p = 1.1) [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

7. Performance Evaluation with Larger Problems [T. Inagaki, Y. Haribara,
et al., Science 354, 603 (2016)] [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Optimal scheduling for SA • N = 2000 complete graphs
with weight wij = {-1, +1} • The time to SDP solution is ~ 5ms (SA on NII server; Intel Xeon CPU W3530 @ 2.8 GHz, 94 GB Memory). • Temperature scheduling is optimized (to use the different normalization of time) to various energy target. Solution by SDP (-60278) SA (200MCS; avg in 100 trials) 0 200 400 600 800 1000 0.006 0.008 0.010 0.012 0.014 Total step (MCS) Computation time (s) to SDP (-60278) -61000 -61500 -62000 0.00 0.01 0.02 0.03 0.04 -63000 -62000 -61000 -60000 -59000 -58000 Computation time (s) Ising energy SA 45 [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)]

G22: random G39: scale-free K2000 : complete Node degree distribution
Best known 13359 2408 − SA 50 ms (Max in 100) 13336 2384 32781 SA 50 ms (Ave. in 100) 13298 2359 32314 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w∈{0, +1} 2000 nodes, 11778 edges edge weight: w∈{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w∈{-1, +1} SA CIM GW-SDP Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 G22: random G39: scale-free K2000 : complete Node degree distribution Best known 13359 2408 − SA 50 ms (Max in 100) 13336 2384 32781 SA 50 ms (Ave. in 100) 13298 2359 32314 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w∈{0, +1} 2000 nodes, 11778 edges edge weight: w∈{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w∈{-1, +1} SA CIM GW-SDP Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 Cut value 30000 31000 32000 33000 Amount 20 15 10 5 0 Amount 20 15 10 5 0 Cut value 13000 13200 13400 35 30 25 Cut value 2200 2300 2400 Amount 20 15 10 5 0 25 (50 ms) G22: random G39: scale-free K2000 : complete Node degree distribution Best known 13359 2408 SA 5 ms (Max in 100) 13245 2339 30118 SA 5 ms (Ave. in 100) 13183 2297 29308 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w{0, +1} 2000 nodes, 11778 edges edge weight: w{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w{-1, +1} SA CIM GW-SDP Table 1 Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 SA5msẚ㍑ (5 ms) (5 ms) (5 ms) Performance of solution quality (CUT histogram in 100 trials) 46 SA and SDP : Intel Xeon X5650 @ 2.67 GHz, Westmere (2010) [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)]

Table of approaches Deterministic Stochastic Binary neuron* Hopfield Network [Hopfield
1982] Simulated Annealing [Kirkpatrick, et al. 1982] Analog neuron Hopfield-Tank Neural Net [Hopfield&Tank 1986] CIM-cSDE *[McCulloch&Pitts 1943] Comment: Boltzmann Machine is also proposed around the time [Hinton&Sejnowski 1983] but the context is slightly different from optimization problems. 47 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Many-core Processor • PEZY-SC (MIMD architecture) @ 733 MHz •
8192 threads in parallel on chip (1024 cores x 8 threads on chip) • 3 TFlops (single precision) 48 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)] PEZY-SC (1024PE) Prefecture (256PE) City (16PE) Village (4PE) PE Program Counter × 8 L1 Instruction Cache 64bit × 256w (2KB) ALU 4FP ops/cycle Register File 32bit × 256w (1KB) Local Storage 32bit x 4096w (16KB) PE L1 Data Cache 2KB Village (4PE) Village (4PE) Village (4PE) Village (4PE) Special Function Unit City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) Prefecture (256PE) Prefecture (256PE) Prefecture (256PE) Prefecture (256PE) DDR4 DRAM 19GB/s PCIe Gen3 x8 PCIe Gen3 x8 PCIe Gen3 x8 PCIe Gen3 x8 L3 Data Cache 2MB DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s ARM 926 ARM 926 L2 Data Cache 64KB L2 Instruction Cache 32KB L3 Instruction Cache 128KB Atomic Cache 128B L1 Data Cache 2KB PE PE PE Thread0 front Thread1 front Thread3 front Thread2 front Thread1 back Thread0 back Thread3 back Thread2 back clock clock clock clock (B) (A)

Result : Time to target 49 Ensemble average in 100
trials (26 for experiment) with best/worst envelope. Note : worst 3 cases of HN didn’t reach the target. Best (ms) Avg (ms) Hardware CIM 0.071 0.264 DOPO+FPGA HN 0.924 1.84 CPU SA 2.10 3.20 CPU HTNN 7.04 9.67 Manycore CIM (experiment) HN SA HTNN 10-5 10-4 0.001 0.010 -30 -25 -20 -15 -10 -5 0 Time (s) Ising energy Compared to: Best Avg HN 13.0141 6.9697 SA 29.5775 12.1212 HTNN 99.1549 36.6288 Relative speed-up for CIM [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Result : N=20000 50 CIM (simulation) HN SA HTNN 10-5
10-4 0.001 0.010 0.100 1 -100 -80 -60 -40 -20 0 Time (s) Ising energy Best (ms) Avg (ms) Hardware CIM 0.14 0.15 Simulation HN 23 26 CPU SA 60 65 CPU HTNN 500 540 Manycore Compared to: Best Avg HN 164.3 173.3 SA 428.6 433.3 HTNN 3571.4 3600.0 Ensemble average in 100 trials with best/worst envelope. (simulation: c-number Langevin via truncated-Wigner) Relative speed-up for CIM [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

8. Conclusion • Scalable implementation of measurement feedback CIM is
proposed with pulsed ﬁber DOPO and FPGA • Validation with numerical simulation and experimental implementations of CIM with n = 100 and n = 2048 • In the complete graph of n = 2000, experimental CIM is 7.0 and 12.1 times faster than HN and SA implemented on Intel Xeon CPU 51

Publication list 1. Yoshitaka Haribara, Shoko Utsunomiya, Yoshihisa Yamamoto, Entropy
18, 151 (2016). 2. Kenta Takata, Alireza Marandi, Ryan Hamerly, Yoshitaka Haribara, Daiki Maruo, Shuhei Tamate, Hiromasa Sakaguchi, Shoko Utsunomiya, Yoshihisa Yamamoto, Scientiﬁc Reports 6, 34089 (2016). 3. Takahiro Inagaki, Yoshitaka Haribara, Koji Igarashi, Tomohiro Sonobe, Shuhei Tamate, Toshimori Honjo, Alireza Marandi, Peter L. McMahon, Takeshi Umeki, Koji Enbutsu, Osamu Tadanaga, Hirokazu Takenouchi, Kazuyuki Aihara, Ken-ichi Kawarabayashi, Kyo Inoue, Shoko Utsunomiya, Hiroki Takesue, Science 354, 603 (2016). 4. Peter L. McMahon*, Alireza Marandi*, Yoshitaka Haribara, Ryan Hamerly, Carsten Langrock, Shuhei Tamate, Takahiro Inagaki, Hiroki Takesue, Shoko Utsunomiya, Kazuyuki Aihara, Robert L. Byer, M. M. Fejer, Hideo Mabuchi, Yoshihisa Yamamoto, Science 354, 614 (2016). *equally contributed 5. Yoshitaka Haribara, Shoko Utsunomiya, Ken-ichi Kawarabayashi, Yoshihisa Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017). 6. Yoshitaka Haribara, Hitoshi Ishikawa, Shoko Utsunomiya, Kazuyuki Aihara, Yoshihisa Yamamoto, Quantum Science and Technology 2, 044002 (2017). 52

Appendix

Appendices Index • Appendix A: Slack variable with constraint optimization
• Appendix B: Multimode simulation • Appendix C: Parallel SA implementation on multithreaded CPU • Appendix D: Order-N Monte Carlo • Appendix E: Iterative method • Appendix F: FPGA implementation

Pump rate normalization • The CSDE: • In the numerical
simulation, 3 components are calculated separately. • gain: 10 steps, loss: instantaneous, coupling: instantaneous • The total loss of the cavity consists of 2: out-coupler loss γout and injection-coupler loss γinj . This should be 1 per unit time dt as in CSDE • When dt = .01, the parameters can be set as γout = .009, γinj = .001, i.e., (γout + γinj )/dt = 1. • Suppose the integration step in the gain medium is tg = 10, gain loss coupling 0.5 1.0 1.5 2.0 1 10 100 1000 Pump rate Photon number cf. NTT setup: 8.9 dB loss per round trip (3.3 dB in PPLN)

Lossy cavity (T = 0.25) Random graph g1 (N =
800), As = 400 ξ = -0.4/(avg.degree) = -0.0085 600 RoundTrips × 100 runs p=0.4 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude p = 0.6 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude p = 1.1 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude Amplitude @ each pump rate 1 N ⁄ i=1 N †ci § 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 Pump rate Average amplitude max avg 0.0 0.5 1.0 1.5 11000 11100 11200 11300 11400 11500 11600 Pump rate Number of cut Oscillation threshold #Cut ≦ 11603 11529.6

CUT values to SDP UB SDP UB SG SG3 GW
CIM SA BLS g1 12083 0.90615 0.93925 0.93867 0.95630 0.95862 0.96201 g2 12089 0.91405 0.94251 0.94367 0.95583 0.95790 0.96120 g15 3171 0.90350 0.94387 0.93882 0.94954 0.95427 0.96184 g17 3171 0.90918 0.93914 0.93598 0.94765 0.95080 0.96090 g53 4009 0.90846 0.94138 0.93540 0.94812 0.95161 0.96034 cut values are compared to #E in the original paper SG, SG3 in [Kahruman et al. 05] GW and CIM by Haribara SA by Tamate BLS in [Benlic et al. 12]

Appendix A: Application

Boxed-Constrained Quadratic Programming (BCQP) • Objective function : • BCQP
is NP-hard • Benchmark instances have discrete values of x: • Discrete steps r for variables x : 2 (Ising) ≦ r ≦ 13 • Quadratic coupling matrix : Qij ∈ {0, ±1, ±2(diag)} min f(x) = 1 2 xT Qx s.t. x 2 ⌦ = { 1  x  1 | x 2 Rn} • If Q is positive semide nite (convex) 㱺 solvable in polynomial time • Ellipsoid method • Gradient descent • Interior point method • If Q is negative semide nite or inde nite 㱺 NP-hard • Branch&Bound with SDP lower bounding [S. Burer, et al., D. Comput Optim Appl 43, 181 (2009)] • Local methods: [N.I.M. Gould, and P.L. Toint, Appl. Optim. 72, 149 (2002).] • Global methods: [P. De Angelis, et al., Developments in Global Optimization, pp. 73-94 (1997).] [P.M. Pardalos, et al., J. Glob. Optim. 1, 15 (1991); S. Burer, et al., D. Comput Optim Appl 43, 181 (2009)] 59

Parameter Sweep • Instance : n=10 spins of r=5 valued
(x = -1, -1/2, 0, 1/2, 1), OPT = -15.25 Parameter search (Best in 100 runs for each parameter set) n=10, #7 (r=5) n=10, #7 (r=5) OPT = -15.25 n=10, #0 (r=5) OPT = -10.25 n=10, #6 (r=13) OPT = -13.58 60

Slack Variable [T. Leleu] PSJHJOBM QBSBNFUFS PQUJNJ[FE TMBDL WBSJBCMFT TVDDFTT
range of parameters 0.6 ≦ p ≦ 1.3 0.06 ≦ ξ ≦ 1.4 dxi = [( 1 + p x2 i ei)xi X j Qijxj] dt + g dWi dei = ei[ 1/2 + IR+ (x xi) + IR+ (xi x+)]dt 61

From BCQP • constraint Σ xi = 1 • mean-variance
portfolio optimization (convex) • max-clique/max-independent set [Motzkin and Straus] (with powerful heuristic solvers) • References: standard quadratic optimization problems [Bomze IM (1998)]; both applications [J Glob Optim 13:369– 387] • more linear constraints • general quadratic programming • below (continuous) & above (binary) threshold OPOs • Mixed-integer programming 62

Truss Topology Optimization [Giger+’06] Ai = {0, 1} Euler’s critical
load Compression, tension, and buckling limit smin i si(Ai)  smax i sEuler i si(Ai) sEuler i = ⇡EiAi 4l2 i  0 Stress of member i • minimize: • subject to: W(A) = m X i=1 ⇢iliAi Material weight density Member length Weight of the structure Equilibrium of forces si = Fi/Ai (Ai = 1) P P 10-member Truss Ground Structure Ai si A Solution

Truss Topology Optimization on CIM W(A) = m X i=1
⇢iliAi • Objective function minimize: • ci = 2Ai - 1: area of the truss member i (Knapsack-type problem) Strategy: 1. ໨తؔ਺ʹ੍໿߲ΛؚΊΔ (buckling constraint) 2. ৳ॖ੍໿ΛεϥοΫม਺ (slack variable) Ͱಋೖ V (c) = m X i=1 ⇢ilici + ↵ K X k=1 f(sE i m X i=1 sikci) Sigmoid Potential dci = [( 1 + p c2 i ei)ci (dV (c)/dci)]dt + gdWi dei = ei[ 1/2 + 1R >0 (s si) + 1R >0 (si s+)]dt

Appendix B: Multimode

4 2 0 2 4 0.6 0.4 0.2 0.0 0.2
0.4 0.6 Longitudinal cavity mode m Ns Amplitude a b c d Normalized amplitude Si t 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time Multimode simulation • Hermite-Gauss modes in free-space cavity

All cubic graphs of n ≦ 16 • Success probability
to ﬁnd the ground state against cubic (3-regular) graphs of order ≦ 16. The parameters are ﬁxed as (p, ξ) = (1.1, -0.1). • Note that the MAX-CUT is still NP-hard against cubic graphs [E. Halperin, et al., 2002] • Number of problems: N 4 6 8 10 12 14 16 #graphs 1 2 5 19 85 509 4060 Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ Ï Ï Ï Ï Ï Ï Ï Ú Ú Ú Ú Ú Ú Ú 4 6 8 10 12 14 16 0.2 0.4 0.6 0.8 1.0 Graph order N Success rate Success probability Number of vertices N ← Multimode avg. ± SD ← Single-mode avg. ± SD ← Multimode worst case ← Single-mode worst case Note: Up to 5 modes 67

n = 800 sparse graphs • Multimode͸ऑ͍݁ ߹ఆ਺Ͱ΋ͦͦ͜ ͜ͷղΛಘΔ͜ͱ ͕Ͱ͖Δ͕ɺ࠷େ
஋͸্͕Βͳ͍ɻ • Single-mode͸݁߹ ఆ਺ΛߋʹڧΊΔ ͱΑΓྑ͍ղΛಘ ΒΕΔɻ g1 cubic 68

Appendix C: CPU and processor Implementation

0.000 0.005 0.010 0.015 0.020 -32 -31 -30 -29 -28
-27 Time (s) Ising energy Diﬀerence in generation of CPU architecture Note: Intel Core i7-6567U @ 3.3 GHz Skylake (2016) was almost the same as Haswell GW-SDP target Intel Xeon X5650 2.67 GHz Westmere (2010) Intel Xeon E3-1225 v3 3.2 GHz Haswell (2013) (2.5x faster) 70

• Serious problem : the numerical simulation is very slow
when the number of vertices N scales up even when we use light- weight SDEs • Accelerator configurations • CPU : all processes are calculated on CPU • CPU+GPU : matrix-vector product is off-loaded on GPU (float) • PEZY-SC : all calculations are parallelized (float) Bottleneck of SDE O(|E|) ≦ O(N2) in total Other part : O(N) in total 71 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Simulation time scaling • Complete graphs for different sizes •
Simulation time for c-number Langevin eq. (200 round trips) • Spec: • CPU : Intel Xeon W3530, 2.80 GHz, Nehalem-WS (2010) • GPU : NVIDIA Tesla C2075, 1.15 GHz, Fermi (2011) w/cuBLAS • PEZY-SC, 1024 core (x8 threads), 733 MHz (2014) 72 O(N2) ?? depends on FPGA resources CPU CPU+GPU PEZY-SC CIM 500 1000 5000 104 10-4 0.001 0.010 0.100 1 10 100 Problem size N Simulation time (s) CPU → PEZY-SC 250x faster

SAͷฒྻԽ 1. ΞϓϦέʔγϣϯґଘͷฒྻԽ 2. ྖҬ෼ׂ (෼ׂ౷࣏๏) 3. ෳ਺Ϛϧίϑ࿈࠯ [A.M. Ferreiro,
et al., J. Global. Optim. 57, 863 (2012)] ඇಉظ ಉظ sync. sync. sync. sync. (Sync. : ࠷খΤωϧΪʔঢ়ଶΛબͿ)

SAͷGPUʹΑΔฒྻԽ • ؔ਺࠷దԽ: ਖ਼نԽ Schwefelؔ਺ • NVIDIA FeForce GTX 470
• ಉظߋ৽ͷฒྻԽ͸Τ ϥʔݮ • ܭࢉ࣌ؒ 76x [A.M. Ferreiro, et al., J. Global. Optim. 57, 863 (2012)] 876 J Glob Optim (2013) 57:863–890 Table 1 Error of the solution obtained by the algorithm, both in the value of the function at the minimum (columns | fa − fr |, where fa is the objective function value found by the algorithm and fr is the exact function value in the real minimum) and in the minimum (columns relative error, measured in · 2) n V0 V1 V2 | fa − fr | Relative error | fa − fr | Relative error | fa − fr | Relative error 8 1.3190 × 10−1 2.4283 × 10−3 1.2891 × 10−2 7.4675 × 10−4 1.7000 × 10−5 4.1656 × 10−5 16 2.3712 × 10−1 3.2557 × 10−3 7.4586 × 10−2 1.8240 × 10−3 1.9000 × 10−6 5.0686 × 10−5 32 3.3774 × 10−1 3.8852 × 10−3 2.8171 × 10−1 3.5468 × 10−3 1.5730 × 10−4 6.0577 × 10−5 64 7.9651 × 10−1 5.9664 × 10−3 9.7831 × 10−1 6.6126 × 10−3 3.1880 × 10−4 1.2132 × 10−4 128 1.9198 9.2648 × 10−3 3.0461 1.1674 × 10−2 1.2225 × 10−4 1.5304 × 10−4 256 3.6230 1.2733 × 10−2 9.5765 8.0283 × 10−2 1.4953 × 10−2 8.2214 × 10−4 512 7.3054 1.8097 × 10−2 26.2282 4.0424 × 10−1 4.6350 × 10−1 4.5503 × 10−3 Table 2 Performance of CUDA version versus sequential version with one CPU core for different number of parameters n V0 V1 V2 Time Time Speedup Time Speedup 8 1,493.7686 5.5436 269.4595 5.6859 262.7121 16 2,529.3072 15.3942 164.3027 15.5889 162.2502 32 4,618.5820 56.9808 81.0550 60.1882 76.7356 64 8,773.0560 106.6075 82.2930 110.2702 79.5596 128 17,169.0000 210.9499 81.3890 215.5416 79.6552 256 34,251.9240 455.4910 75.1978 462.8035 74.0096 512 68,134.5760 871.7434 78.1589 893.7668 76.2330 (ඇಉظ) (ຖεςοϓಉظ) (γϦΞϧ) (ඇಉظ) (ຖεςοϓಉظ) (γϦΞϧ) ical experiments have been performed with the following hardware and software configurations: a GPU Nvidia Geforce GTX470, a recent CPU Xeon E5620 clocked at 2.4 Ghz with 16 GB of RAM, CentOS Linux, NVIDIA CUDA SDK 3.2 and GNU C++ compiler 4.1.2. In what follows, we denote by V0 the sequential implementation, by V1 the parallel asynchronous version and by V2 the parallel synchronous one. 4.1 Analysis of a sample test problem: normalized Schwefel function A typical benchmark for testing optimization techniques is the normalized Schwefel function (see [31], for example): f (x x x) = − 1 n n i=1 xi sin |xi | , −512 ≤ xi ≤ 512, x x x = (x1, . . . , xn). (6) For any dimension n, the global minimum is achieved at the point x x x , the coordinates of which are x i = 420.968746, i = 1, . . . , n, and f (x x x ) = −418.982887. Table 1 illustrates the accuracy for the three versions of the SA algorithm: sequential (V0), asynchronous (V1) and synchronous (V2). For these three versions we use the following configuration: T0 = 1000, Tmin = 0.01, N = 100 and ρ = 0.99. For the parallel versions we use the choice b = 256 and g = 64, for the number of threads per block (block size) and the number of blocks per grid (grid size), respectively, so that the number of Markov chains is 16,384. With this configuration, the algorithm performs 1.8776×109 function evaluations in all cases. 123 code, following the ideas of Sect. 2.2, so that both codes ons and their performance can thus be compared. The numer- ormed with the following hardware and software configu- GTX470, a recent CPU Xeon E5620 clocked at 2.4 Ghz inux, NVIDIA CUDA SDK 3.2 and GNU C++ compiler by V0 the sequential implementation, by V1 the parallel the parallel synchronous one. blem: normalized Schwefel function ptimization techniques is the normalized Schwefel function |xi | , −512 ≤ xi ≤ 512, x x x = (x1, . . . , xn). (6) minimum is achieved at the point x x x , the coordinates of = 1, . . . , n, and f (x x x ) = −418.982887. y for the three versions of the SA algorithm: sequential (V0), nous (V2). For these three versions we use the following = 0.01, N = 100 and ρ = 0.99. For the parallel versions = 64, for the number of threads per block (block size) and rid size), respectively, so that the number of Markov chains n, the algorithm performs 1.8776×109 function evaluations 123 ( x ∈ Rn) Image source : [http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/TestGO_files/Page2530.htm]

Speedup for SA (OpenMP) Concurrency #ﬂips time 1 27011.7 0.00530
2 19722.9 0.00437 4 16302.1 0.00448 8 9538.8 0.00408 16 6200.5 0.00838 24 4635.5 0.01014 n = 2000 Intel Xeon E5-2680 @ 2.50 GHz with OpenMP Weighted complete graph K2000 (Jij = ±1) (Courtesy of T. Sonobe & K. Kawarabayashi @ NII) 1.3x 75 n = 105 Concurrency #ﬂips time 1 2218790.8 4.258 2 1358508.9 3.132 4 923020.6 2.479 8 652692.7 2.124 16 533915.5 2.194 24 471558.2 2.527 2.0x

Appendix D: Monte Carlo

Single spin ﬂip O(N) Monte Carlo • Partition function :
• k = (k1, …, kNb) ͷ૊߹ͤͰల։ mean Poisson distribution Inverse temperature Positive parameter =:W 77

SA with single spin ﬂip O(N) MC • Complete graphs
(J=±1-weighted with uniform distribution) • Scaling : O(βεN), while naive MC scales as O(N2) 0 50 100 150 200 250 300 -30 -25 -20 -15 -10 -5 0 Monte Carlo sweeps Ising energy Naive MC (2.0 s) = 2 (24.9 s) = 4 (36.6 s) = 6 (47.3 s) = 8 (59.0 s) = 10 (70.0 s) cf. Constant temperature T = 1.0, ε = 0.1 Simulated Annealing N = 2000, β(t) = 40 log(1+10t), t∈[0,1] Naive Monte Carlo Order-N Monte Carlo 10 100 1000 104 10-6 10-5 10-4 0.001 0.010 0.100 Number of spins N CPU time per MCS (s) 78

Appendix E: Extended Implementation

w12 =2 v1 v2 v3 v4 w34 =3 w23 =1
w24 =3 w14 =-2 w 13=-2 v1 v2 v3 v4 v1 v2 v3 v4 v1 v2 v3 v4 time (in single run) Jij ={0,±1} Weighted graph Temporal graph Weight decomposition weight distribution -3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 Edge weight w Frequency Calculate from large (signiﬁcant) weight 80

Variant of iterative method Method coupling Jij CUT magnify 2k
scaling 280.5 2bit-itr single bit 297.4 accumulate or from MSB 370.4 accumulate 2bit-itr bit representation wij = 0 0 1 0 0 1 … 1 1 0 1 0 digit : k … 4 3 2 1 0 ↓ ↓ ↓ ↓ ↓ 2k 2k-4 24 23 2 0 0 1 1 1 1 … 1 1 1 1 1 MSB accumulate bit (or operation): Jij is on if wij > threshold Jij is 2bit (on/oﬀ, +1/-1) 81

Torus with Gaussian dist. • torusg3-8 : 3D toroidal graph
(degree 6) with Gaussian weight: wij ∈ N(0, 1), equivalent to 3D Edwards-Anderson spin glass model • |V| = 512, |E| = 1536, 20-bit weight Implementation Cut SDP UB 457.36 BestKnown 416.85 SA 392.23 continuous Jij ∈ R 393.59 proposed (itr) 385.57 Jij ∈ {-1,0,+1} 354.10 [http://dimacs.rutgers.edu/Challenges/Seventh/Instances/] 280 300 320 340 360 380 0.00 0.05 0.10 0.15 0.20 0.25 CUT Probability 280 300 320 340 360 380 0.00 0.05 0.10 0.15 0.20 0.25 CUT Probability t = 800 best continuous accumulate weight distribution -3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 Edge weight w Frequency best in 1000 trials 82

torusg3-15 from 7-th DIMACS • 3D toroidal graph (degree 6)
with Gaussian weight: wij ∈ N(0, 1) • |V| = 153 = 3375, |E| = 10125 Implementation Cut SDP UB 3134.60 BestKnown 2602.03 SA 2367.84 continuous Jij ∈ R 2560.43 proposed (itr) 2470.80 Jij ∈ {-1,0,+1} 2214.50 [http://dimacs.rutgers.edu/Challenges/Seventh/Instances/] 10 trials 83

Appendix F: FPGA [patent]

Realization and Evaluation of Measurement Feedb...

Realization and Evaluation of Measurement Feedback Coherent Ising Machines for Combinatorial Optimization Problems

More Decks by Yoshitaka Haribara

Other Decks in Research

Featured

Transcript