Slide 1

Slide 1 text

Realization and Evaluation of Measurement Feedback Coherent Ising Machines for Combinatorial Optimization Problems ૊߹ͤ࠷దԽ໰୊ͷͨΊͷ ଌఆϑΟʔυόοΫܕίώʔϨϯτɾΠδϯάϚγϯ ͷ࣮ݱͱධՁ Yoshitaka Haribara Advisor: Prof. Kazuyuki Aihara January 23, 2018 Department of Mathematical Informatics University of Tokyo

Slide 2

Slide 2 text

Thesis Structure 1.Introduction 2.Maximum Cut Problem 3.Degenerate OPO 4.Coherent Ising Machine 5.Numerical simulation 6.Experiment 1: Results and Validation 7.Experiment 2: Performance Evaluation 8.Conclusion 2 [1,2,5] Ref. [3,4,6]

Slide 3

Slide 3 text

1. Introduction

Slide 4

Slide 4 text

Processor Trends 4 • Uniprocessor performance declined around 2003 (from 52% to 22% / year) • GPU (cf. NVIDIA Tesla V100: 5120 core) • deep learning [1] / scientific simulation (molecular dynamics [2], visualizing CT scan [3], astrophysics many body system [4]) • Field programmable gate array (FPGA; programmable hardware) • Image processing (medical[5], broadcast[6]) / Datacenter (search[7]) [Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2015 by K. Rupp] 1980 1990 2000 2010 1 10 100 1000 104 105 Year #Cores Single-thread performance (SPECint x 103) Freq.(MHz) Power (W) [1] A. Krizhevsky, et al., NIPS 1097 (2012).; Y. LeCun, et al., Nature 521, 436 (2015). [2] A. Götz, et al., Journal of chemical theory and computation 8, 1542 (2012). [3] O. Kutter, et al., Computer Methods and Programs in Biomedicine 94, 250 (2009). [4] L. Nyland, et al., GPU gems 3, 677 (2007). [5] https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-medical.pdf [6] https://www.altera.com/en_US/pdfs/literature/wp/wp-brdcst0306.pdf [7] A. Putnam, et al. Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE (2014).

Slide 5

Slide 5 text

Google+UCSB [http://web.physics.ucsb.edu/~martinisgroup/ photos/Linear9XmonSurfaceCode.jpeg] Traffic flow in Beijing (from Ref. [6]) TPU (Google) [https://cloud.google.com/blog/big-data/ 2017/05/an-in-depth-look-at-googles- first-tensor-processing-unit-tpu] Problem Specific Processors • Application specific IC (ASIC) • TPU (1st gen. for prediction) in neural network [1,2] • 8 bit quantization to reduce the number of multiplier • Matrix (systolic array) and nonlinear activation unit • (Universal) Quantum Computer (n ≧ 49 [3]) • Quantum simulation for Fermionic structures in chemistry and material science [4] • Quantum Annealer (Ising problem) • Machine learning to find Higgs Boson [5] • Traffic flow optimization [6] 5 [1] https://cloud.google.com/blog/big-data/2017/05/an-in- depth-look-at-googles-first-tensor-processing-unit-tpu [2] D. Silver, et al., Nature 550, 354 (2017). [3] J. Kerry, et al., Nature 519, 66 (2015) for n = 9. [4] J.R. McClean, et al., arXiv:1710.07629 [quant-ph] (2017). [5] A Mott, et al., Nature 550, 375 (2017). [6] F. Neukart, et al., arXiv:1708.01625 [quant-ph] (2017). D-Wave [https://www.dwavesys.com/ resources/media-resources]

Slide 6

Slide 6 text

Complexity Classes NP P = NP = NP-complete P NP-complete NP-hard NP-hard P ≠ NP P = NP Computational Complexity • Ising problem (NP-hard) • minimize: • Quantum Annealers [1] • D-Wave (n = 2000, chimera graph) [2] / NEDO • Ising Processor • Hitachi (n = 20,000 CMOS and FPGA) [3] / Fujitsu (FPGA) [4] Ising machines 6 1k-spin sub-array 780 × 380 μm2 4 mm 3 mm 1k spin circuits SRAM I/F Fig. 5—Ising Chip Photograph. The chip has 20,000 spin circuits in a 3-mm × 4-mm = 12-mm2 area. I/F: interface Hitachi [http://www.hitachi.com/rev/ pdf/2016/r2016_06_110.pdf] H = X i6=j Jij i j, i 2 { 1, 1} (i = 1, . . . , n) HQA = X i6=j Jij z i z j (t) n 1 X i=0 x i Ising Transverse field [1] T. Kadowaki and H. Nishimori, Phys. Rev. E 58, 5355 (1998). [2] https://www.dwavesys.com/sites/default/files/D- Wave%202000Q%20Tech%20Collateral_0117F2.pdf [3] M. Yamaoka, et al., IEEE Journal of Solid-State Circuits 51, 303; C. Yoshimura, et al., IEEE CANDAR 436. [4] S. Matsubara, et al., In Conference on Complex, Intelligent, and Software Intensive Systems, 432. K4,4 in chimera [R. M. Karp, Complexity of Computer Computations, (Plenum, New York, 1972), pp. 85–103]

Slide 7

Slide 7 text

N = 4, 16 CIM [A. Marandi, Z. Wang, K. Takata, R. L. Byer, and Y. Yamamoto, Nat. Photonics 8, 937 (2014)] [K. Takata, A. Marandi, R. Hamerly, Y. Haribara, et al., Scientific Reports 6, 34089 (2016)] • Free space coherent Ising machines of N = 4 [A. Marandi, et al.] and N = 16 [K. Takata, et al.]. N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* TR * Tcavity =4TR* Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO OPO OPO1* OPO1* π* MAX&CUT*Problem*on*OPO N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* •  OC*and*IC*~*4%* •  N=4*MAX&CUT* –  Expected*Result:*2*OPOs*in*|0〉,*2*OPOs*in*|π〉 •  Measurement:*Unequal&Arm*Interferometer* * Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Delay*3*=*3TR ** Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO1* OPO2* OPO3* OPO4* π* π* π* π* π* π* MAX&CUT*Problem*on*OPO*Network* November*2014* OPO*Ising*Machine* 8* 1TR 2TR 3TR periment* Delay*1* Delay*2* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Pulse interval For N = 16, 1TR, 8TR, 15TR delay lines 㱺 Complete graph K4 7 Optical Delay Lines

Slide 8

Slide 8 text

My Contribution • Proposed a scalable architecture of coherent Ising machine with DOPO and FPGA, which enabled to implement arbitrary topology of graphs. • Mathematical modeling with CSDE and numerical simulations with reasonable parameters which led to the experimental success • Benchmark on MAX-CUT instances up to n ≦ 2,000 (experiment) and n ≦ 20,000 (simulation) against existing algorithms

Slide 9

Slide 9 text

2. Maximum Cut Problems

Slide 10

Slide 10 text

CUT = 4 Maximum Cut (MAX-CUT) • Given a graph G(V, E) with weight wij. • Find the largest cut in G. • Cut is a partition of V into 2 subsets. • #Cut = total weight of crossing edges. • The term (1 - σi σj)/2 only counts the crossing edge, where σ = {+1, -1}. • Equivalent to anti-ferromagnetic Ising problem (minimizing Σ wij σ i σ j ). +1 +1 -1 -1 10 [R. M. Karp, Complexity of Computer Computations, (Plenum, New York, 1972), pp. 85–103]

Slide 11

Slide 11 text

Algorithms for MAX-CUT -0.87856: Semi-definite Programming (SDP) relaxation [Goemans and Williamson, 1995] -0.53112: Variant of Spectral partitioning (minimum eigenvalue) [Trevisan, 2008] -0.5 + ε (ε > 0): Random Walk [Kale, et al., 2010] -0.5: Greedy [Sahni and Gonzalez, 1976] Approx. ratio -Breakout Local Search [Benlic, et al., 2013] -Simulated Annealing [Kirkpatrick, et al., 1983] -SG3 (Variant of Greedy) [Kahruman, et al., 2007] UGC [2] 0.94117 PCP [1] NP-hard P Planar [Orlova and Dorfman, 1972; Hadlock, 1975], Weakly bipartite [Grötschel, et al., 1981], Positive weighted without long odd cycle [Grötschel, et al., 1984], Integer weight (≦ N) with fixed genus [Galluccio, et al., 2001] Graphs solvable in P Heuristics 11 [1] S. Arora and S. Safra., JACM 45, 70 (1998). [2] S. Khot, STOC 767 (2002).

Slide 12

Slide 12 text

SDP relaxation • A continuous-valued relaxation problem of MAX-CUT: • The matrix becomes Positive Semi-Definite. • The optimal value of this reduction problem (class P) is called SDP Upper Bound. • A solution for the original MAX-CUT is obtained by random projections. Expectation values > 0.87856 of OPT (MAX-CUT). vi vj [M. X. Goemans and D. P. Williamson, J. ACM 42, 1115 (1995)] 12

Slide 13

Slide 13 text

Simulated Annealing (SA) • Meta-heuristic : gradually decreasing the “temperature”, i.e., noise, then converge to the ground state in infinite time [Geman and Geman, IEEE PAMI 6, 721 (1984)] • Performance is heavily dependent on : temperature scheduling (log), data structure (vector or array), multi-spin coding (below) 0 1 0 0 0 1 … 1 1 0 1 0 1 0 1 0 1 0 … 1 0 0 1 1 ↓ bitwise XOR 1 1 1 0 1 1 … 0 1 0 0 1 64-bit multi-spin coding spin σj weight wij (wij=±1 complete graph) → bit count (SIMD instruction) where wij ^ σj (Courtesy of S. Tamate & T. Sonobe @ NII) 13 [Kirkpatrick, et al., Science 220, 671 (1982)] Pflip = ⇢ 1 ( E  0) exp( E) ( E > 0)

Slide 14

Slide 14 text

BLS: Breakout Local Search • Combination of local search & perturbation • 3 types of perturbation: uni-directional, bi-directional, random • Constant time spin update with bucket data structure 1. Local search (steepest descent) 2. Perturbation (forced spin flipping) P 1-P Q 1 - Q → ⁶ random Perturbation 14

Slide 15

Slide 15 text

Backet Data V1 V2 V3 V4 V5 V6 Cut 1 -1 1 -1 1 1 -1 1 -1 ɾ ɾ ɾ ɾ ɾ ɾ 1 2 3 4 5 6 gmax : 3 2 1 0 -1 -2 -3 : gmin gmax : 3 2 1 0 -1 -2 -3 : gmin S1 S2 S1 S2 • Gain list for each side of the partition and vertex address list enable constant O(1) time spin configuration update. 6 4 5 2 1 3 maxgain maxgain 15

Slide 16

Slide 16 text

Greedy Algorithm and its Variant • Input : A graph with edge weights wij. • output : A cut {S1, S2} and the cut value cut(S1, S2). 1. ∀i ∈ V \{S1 ∪ S2 } , score(i ) = max{ w(i, S1 ), w(i, S2 )} SG3: score(i ) = |w(i, S1 ) - w(i, S2 )| 2. cut += score(imax) x y S1 S2 ∀i ∈ V’ x y S1 S2 ∀i ∈ V’ imax “best in” [S. Kahruman, et al., Int. J. Comp. Sci. and Eng. 3, 211 (2005)]

Slide 17

Slide 17 text

Neural Network based Optimization Algorithms • Hopfield Network (HN) [J. J. Hopfield, PNAS 79, 2554 (1982)] • Update rule : majority vote: • Derandomized SA • Hopfield-Tank NN (HTNN) [Hopfield and Tank, Science 233, 625 (1986)] • Parallel implementation with a many-core processor sigmoid sgn 0 @ X j Jijxj bj 1 A

Slide 18

Slide 18 text

QUBO-MILP • MAX-CUT → UBQP/QUBO → IBM CPLEX (MILP). Mixed- integer programming with a linear objective function. • zij = xi xj when xi ,xj ∈ {0, 1}, and define the convex hull of {(xi , xj , xixj ) | xi ,xj ∈ {0, 1} }. (Rooted Correlation Semimetric Polytope) • LP-based branch&cut minimize : X i

Slide 19

Slide 19 text

3. Degenerate OPO

Slide 20

Slide 20 text

Optical Parametric Oscillator (OPO) • In a degenerate optical parametric oscillator (DOPO), a nonlinear crystal is placed in a cavity (mirrors). • Frequency down conversion via molecular vibration occurs. The input with frequency ωp (pump) is converted to ωs (signal). • Degeneracy: • OPO can be used as a physical random number generator with quantum noise [A. Marandi, et al., Optics express 20, 19322 (2012)] [C. Nabors, et al., JOSA B 7, 815 (1990)] frequency phase Self-phase-locking 20 ( ) Pump Signal Nonlinear crystal Pump laser

Slide 21

Slide 21 text

Formulation of a DOPO • System Hamiltonian H ↓ Liouville von-Neumann Equation • Quantum Master Equation for density matrix ρ ↓ Phase Space Expansion • Fokker-Planck Equation for Wigner quasi-distribution W(α) ↓ Ito Formula • Langevin Equation: phase space variable α 21 i~ @ˆ ⇢ @t = [ ˆ H, ˆ ⇢] [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)] ˆ ⇢ = Z d e ⇤ˆ a ˆ a† ⇢Z e ˆ a⇤ ⇤ˆ aW(↵) d↵

Slide 22

Slide 22 text

Pump Field System Hamiltonian Signal Field Parametric Coupling External Pumping Irreversible Interaction with the Reservoir where H = Hfree + Hinteraction + Hpump + Hsys-res Hfree = ~!sˆ a† s ˆ as + ~!pˆ a† p ˆ ap Hinteraction = i~  2 (ˆ a†2 s ˆ ap ˆ a† p ˆ a2 s ) Hpump = i~("ˆ a† p "⇤ˆ ap) Hsys-res = ~(ˆ as ˆ† Rs + ˆ Rsˆ a† s + ˆ ap ˆ† Rp + ˆ Rpˆ a† p ) ( ) Pump Signal 22 : photon creation operator : photon annihilation operator ˆ a ˆ a† [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)]

Slide 23

Slide 23 text

• Using the Liouville von-Neumann Equation • Master equation for the density matrix ρ • Born and Markov approximation, Adiabatic elimination • No analytic solution in the κ ≠ 0 case Master Equation i~ @ˆ ⇢ @t = [ ˆ H, ˆ ⇢] 23 @ˆ ⇢ @t = i!s[ˆ a† s ˆ as, ˆ ⇢] + i  2 [ˆ a†2 s ˆ ap ˆ a† p ˆ a2 s , ˆ ⇢] + s 2 (2ˆ as ˆ ⇢ˆ a† s ˆ a† s ˆ as ˆ ⇢ ˆ ⇢ˆ a† s ˆ as) + s¯ ns(ˆ as ˆ ⇢ˆ a† s + ˆ a† s ˆ ⇢ˆ as ˆ a† s ˆ as ˆ ⇢ ˆ ⇢ˆ asˆ a† s ). [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)]

Slide 24

Slide 24 text

Fokker-Planck Equation • Expanded with Wigner quasi-distribution W(α) • Fokker-Planck Eq. for a single DOPO in truncated Wigner representation • Note: • Truncation (dropping the cubic derivatives) is valid in the small quantum noise limit • Wigner representation can describe the nonclassical field @W @t = ⇣ 2 + i! ⌘ @ @↵ ↵ + ⇣ 2 i! ⌘ @ @↵⇤ ↵⇤ + (¯ n + 1 2 ) @2 @↵@↵⇤ W 24 [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)] ˆ ⇢ = Z d e ⇤ˆ a ˆ a† ⇢Z e ˆ a⇤ ⇤ˆ aW(↵) d↵

Slide 25

Slide 25 text

• Complex amplitude (c + is) ∈ is described as the SDE: Normalized pump rate p (pth = 1), Steady state amplitude As Langevin Equation Squeezed vacuum Phase Sensitive Amplification Squeezed states Pitchfork bifurcation Independent Gaussian noise (μ=0, σ=1) loss saturated gain [P. Kinsler and P. D. Drummond, Phys. Rev. A. 43, 6194 (1991)] 25

Slide 26

Slide 26 text

Ising problem Coherent Ising Machine (CIM) Objective function Whole loss of DOPO network Spin variable DOPO phase Adjacency matrix Coupling constant DOPO network • Feedback (injection) term: • Effective potential : detected amplitude : coupling constant [Z. Wang, A. Marandi, K. Wen, R. L. Byer, and Y. Yamamoto, Phys. Rev. A 88, 063853 (2013)] -1.0 -0.5 0.5 1.0 -0.004 -0.002 0.002 26

Slide 27

Slide 27 text

4. Coherent Ising Machine [Y. Haribara, S. Utsunomiya, and Y. Yamamoto, Entropy 18, 151 (2016)]

Slide 28

Slide 28 text

N = 4, 16 CIM [A. Marandi, Z. Wang, K. Takata, R. L. Byer, and Y. Yamamoto, Nat. Photonics 8, 937 (2014)] [K. Takata, A. Marandi, R. Hamerly, Y. Haribara, et al., Scientific Reports 6, 34089 (2016)] • Free space coherent Ising machines of N = 4 [A. Marandi, et al.] and N = 16 [K. Takata, et al.]. N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* TR * Tcavity =4TR* Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO OPO OPO1* OPO1* π* MAX&CUT*Problem*on*OPO N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* •  OC*and*IC*~*4%* •  N=4*MAX&CUT* –  Expected*Result:*2*OPOs*in*|0〉,*2*OPOs*in*|π〉 •  Measurement:*Unequal&Arm*Interferometer* * Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Delay*3*=*3TR ** Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO1* OPO2* OPO3* OPO4* π* π* π* π* π* π* MAX&CUT*Problem*on*OPO*Network* November*2014* OPO*Ising*Machine* 8* 1TR 2TR 3TR periment* Delay*1* Delay*2* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Pulse interval For N = 16, 1TR, 8TR, 15TR delay lines 㱺 Complete graph K4 28 Optical Delay Lines

Slide 29

Slide 29 text

Measurement Feedback CIM (proposed) • To stabilize the O(N) delay lines is experimentally hard for large N • Multiple DOPO pulses in a fiber is coupled by a FPGA circuit • Stochastic differential equation for each DOPO signal amplitude: SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity discretize in time 29 gain I/O (without noise) Time evolution g(c) = c p 1 + c2 c(t) = s 1 1 + exp( t) [Y. Haribara, S. Utsunomiya, and Y. Yamamoto, Entropy 18, 151 (2016)] dci = [( 1 + p c2 i s2 i )ci + N X j=1 ⇠ij ˜ cj] dt + 1 As r c2 i + s2 i + 1 2 dWci

Slide 30

Slide 30 text

5. Numerical Simulations [Y. Haribara, S. Utsunomiya, K. Kawarabayashi, Y. Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017)]

Slide 31

Slide 31 text

Working Equations ˜ ci = cout i p Tmes = ci r 1 Tmes Tmes fi As Measured Value SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity 0 20 40 60 80 100 1.0 0.5 0.0 0.5 1.0 Normalized time Number of round trips DOPO signal amplitude ci 33 34 35 36 0.80 0.85 0.90 0.95 1.00 1.05 1.10 A C B A B C [Y. Haribara, S. Utsunomiya, and Y. Yamamoto, Entropy 18, 151 (2016)] [Y. Haribara, S. Utsunomiya, K. Kawarabayashi, Y. Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017)] A B C dci = [(p c2 i s2 i )ci] dt + 1 As r c2 i + s2 i + 1 2 dWci dsi = [( p c2 i s2 i )si] dt + 1 As r c2 i + s2 i + 1 2 dWsi ci(t + t) 7! p 1 Tmesci(t) + p Tmes fi As si(t + t) 7! p 1 Tmessi(t) + p Tmes fi As ci(t + t) 7! p 1 Tinjci(t) + p Tinj⇠ n X j=1 Jij ˜ cj si(t + t) 7! p 1 Tinjsi(t)

Slide 32

Slide 32 text

Parameter sweep (g14) Mean Best 0.2 0.4 0.6 0.8 1.0 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 Pump rate p Coupling constant 2850 2870 2890 2910 0.2 0.4 0.6 0.8 1.0 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 Pump rate p Coupling constant 2890 2910 2930 2950 100 trials for each parameter

Slide 33

Slide 33 text

Parameter sweep (K100) Mean Best 100 trials for each parameter

Slide 34

Slide 34 text

Simulation results g22, g39 in G-set • g22, |V| = 2000, |E| = 19990 (density 1%), random graph • Gradually pumped from p = 0.9 to p = 1.3 linearly. • Coupling constant ξ = -0.02/√ = 0.004 • g39, |V| = 2000, |E| = 11778 (density 0.59%), scale-free with ±1 weight • Gradual pump (linear: from p = 0.9 to 1.3) • Coupling constant ξ = -0.05/√ = 0.0042 CUT Ratio SDP UB 14136 — BestKnown 13359 0.945 SA 13351 0.944 Best 13081 0.925 Mean 12997.99 0.919 GW-SDP 12993 0.919 CUT Ratio SDP UB 2877 — BestKnown 2408 0.946 SA 2379 0.943 Best 2273 0.931 Mean 2228.69 0.926 GW-SDP 2200 0.923

Slide 35

Slide 35 text

Results for G-set (sparse) • For the random graphs in G-set, the number of edges m are independent from n, which results in the constant scaling for SA as O(|E|) • GW (PD relative gap 10-3), best in 100 trials of SA and CIM with hysteretic optimization (estimated time) on Intel Xeon X5650 (2.67 GHz), 94 GB RAM. • MILP solved by CPLEX-MILP (mixed integer linear programming) [Ikuta, et al., 2015], BLS [Benlic, et al., 2013] CIM SA GW 800 2000 5000 7000 10000 0.001 0.1 10 1000 105 Problem size N Computation time s 35 Topology GW-SDP n roundings MILP CIM w/ hysteretic SA BLS Comp. time 2 s - 105 s 12 s - 105 s 0.05 s 0.05 s 1 s - 104 s Random 0.9273 0.9116 0.9448 0.9471 0.9491 Scale-free 0.9288 0.9269 0.9451 0.9510 0.9547 Toroidal 0.9375 0.9618 0.9511 0.9548 0.9616 All 0.9303 0.9349 0.9464 0.9502 0.9540 (Random graph)

Slide 36

Slide 36 text

Computation time scaling on complete graphs • GW : interior point method for solving SDP → O(N 3.5) • SA : Constant Monte Carlo steps × avg. degree → O(N 2) • CIM : Optical time-scale of DOPO turn-on delay time () CIM (simulation) SA GW SG3 BLS 40 160 800 4000 20000 10-4 0.01 1 100 104 106 Problem size N Computation time (s) () 200 m fiber 2 km fiber 20 km fiber 2 GHz clock 20 GHz clock 200 GHz clock 100 1000 104 105 106 107 10-5 10-4 0.001 0.010 0.100 1 Problem size N Computation time (s) Graph order N Graph order N Cavity length Pulse repetition Note for SA: N = 105 㱺 3 sec for 20 MCS (multi-core CPU)

Slide 37

Slide 37 text

6. Experimental Results and Validation [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

Slide 38

Slide 38 text

Machine configurations 38 Stanford NTT N Number of pulses 100 (+60 for stabilization) 2048 (+2834 for stabilization) Round trip time (Fiber length) 1.6 μs (320 m) 5 μs (1 km) Pulse repetition freq. (Pulse interval) 100 MHz (= 10 ns) 1 GHz (= 1 ns) FPGA Xilinx Virtex-6 (SX475T) 2x Xilinx Virtex-7 (VX690T) Connection arbitrary topology with Jij = {+1, 0, -1} Current operational scheme Gradual coupling Gradual pumping [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)] [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)] SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity Chap. 6 Chap. 7

Slide 39

Slide 39 text

Single Trial N = 16 Cubic graph (Möbius ladder) N = 100 Random graph (Erdös-Rényi) and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injec- tion) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is real- ized. Details are provided in the supplementary materials (26). imental schematic of a measurement-feedback–based coherent Ising machine. n–multiplexed pulsed degenerate optical parametric oscillator is formed by a nonlinear dically poled lithium niobate (PPLN)] in a fiber ring cavity containing 160 pulses. A fraction e is measured and used to compute a feedback signal that effectively couples the ependent pulses in the cavity. IM, intensity modulator; PM, phase modulator; LO, local G, second-harmonic generation; FPGA, field-programmable gate array. A Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 150 -600 -400 -200 0 200 400 600 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 Computation Time (µs) Computation Time (µs) 0 50 100 150 200 Graph A er of Problem Instances 400 600 800 1000 All 4060 16-vertex cubic graphs Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy Graph A 100% 38% Graph C 58% Graph B Exact Solution Exact Solution Exact Solution on November 6, 2 http://science.sciencemag.org/ Downloaded from 0 50 100 150 200 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Round trip number in-phase amplitude 0 50 100 150 200 0 5 10 15 20 Round trip number Cut ments Jij ) and the length-N vector h (with elements hi ). We have realized a system with a scalable architecture that uses measurement feedback in place of a network of optical delay lines [which were used in initial, low-connectivity, nonre- programmable demonstrations of the concept (18, 24, 25)]. Our 100-spin Ising machine allows connections between any spin and any other spin and is fully programmable. We show that measurement-feedback–based OPO Ising ma- chines can solve many different Ising problems, and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injec- tion) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is real- ized. Details are provided in the supplementary materials (26). ack–based coherent Ising machine. etric oscillator is formed by a nonlinear cavity containing 160 pulses. A fraction ck signal that effectively couples the ulator; PM, phase modulator; LO, local ammable gate array. Roundtrip Number 50 100 150 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 mputation Time (µs) Computation Time (µs) 50 100 150 200 ph A Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy on November 6, 2016 http://science.sciencemag.org/ Downloaded from between qubits remains a major challenge (15), with important implications for the efficiency of AQC/QA systems (16). Networks of coupled optical parametric oscil- lators (OPOs) are an alternative physical system, with an unconventional operating mech- anism (17–20), for solving the Ising problem (21, 22) and by extension many other com- binatorial optimization problems (23). For- mally, the N-spin Ising problem is to find the configuration of spins si ∈ f−1; þ1g (i = 1, ..., N) that minimizes the energy function H ¼ − X 1≤i < j ≤N Jij si sj − X 1≤i ≤N hi si , where the par- ticular problem instance being solved is specified by the N × N matrix J (with ele- ments Jij ) and the length-N vector h (with elements hi ). We have realized a system with a scalable architecture that uses measurement feedback in place of a network of optical delay lines [which were used in initial, low-connectivity, nonre- programmable demonstrations of the concept (18, 24, 25)]. Our 100-spin Ising machine allows connections between any spin and any other spin and is fully programmable. We show that measurement-feedback–based OPO Ising ma- chines can solve many different Ising problems, and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injec- tion) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is real- ized. Details are provided in the supplementary materials (26). Fig. 1. Experimental schematic of a measurement-feedback–based coherent Ising machine. A time-division–multiplexed pulsed degenerate optical parametric oscillator is formed by a nonlinear crystal [periodically poled lithium niobate (PPLN)] in a fiber ring cavity containing 160 pulses. A fraction of each pulse is measured and used to compute a feedback signal that effectively couples the otherwise-independent pulses in the cavity. IM, intensity modulator; PM, phase modulator; LO, local oscillator; SHG, second-harmonic generation; FPGA, field-programmable gate array. Graph A Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 150 -600 -400 -200 0 200 400 600 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 Computation Time (µs) Computation Time (µs) 0 50 100 150 200 Graph A em Instances 600 800 1000 All 4060 16-vertex cubic graphs Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy f Runs 60 80 100 100% State Energy 58% Exact Solution Exact Solution xact Solution RESEARCH | REPORTS on November 6, 2016 http://science.sciencemag.org/ Downloaded from the M energ Th (cubi Fig. 2 matr Fig. 4. Results with various- size and various-density random graphs. (A) Observed probability of obtaining a solution whose cut size is at least x% of the global optimum (maximum cut), as a function of graph size N, for random cubic graph instances. Error bars indicate 1 SD, which is dominated by the difference in difficulty between the various problem instances. (B) The runtime that would be required to obtain a solution of a particular accuracy with 99% probability. (C) The evolution of the in-phase components ci of the N = 100 OPO pulses as a function of the computation time, for a single run with the graph shown in the (D) inset. (D) The graph cut size achieved as a function of the computation time. (Inset) The graph being solved. (E) Observed success probability of obtaining a solution with a particular accuracy as a function of the density of edges in the graph. Experiments were performed on randomly generated N = 100-vertex graphs Runtime to obtain 99% Success Probability (s) 0 10-4 10-3 10-2 10-1 100 Rand Graph Size N=|V | Success Probability (%) 0 20 40 60 80 100 0 20 40 60 80 100 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy Random cubic graphs Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 -600 -400 -200 0 200 400 600 0 50 100 150 Graph D: |V|=100, |E|=495 A B C Computation Time (µs) Computation Time (µs) Roundtrip Number 0 50 100 0 100 200 300 Ising Energy -200 -100 0 100 200 300 400 0 50 100 150 MAX CUT / Ground State Energy Graph Cut Size Graph D: |V |=100, |E|=495 D Ising Energy -160 -140 -120 Ising Energy -100 -80 -60 Ising Energy -20 0 20 Fig. 3. Results with various-size Möbius ladder graphs. (A) Observed probability of obtaining a ground state of the Möbius ladder graph in a single run, as a function of the size N of the graph. Multiple 100-run batches were performed for each graph size to obtain the standard deviations, which are shown as error bars. (B to D) Histograms of obtained solutions in 100 runs for the graphs shown in the insets. 0 50 100 150 200 -2 -1 0 1 2 Round trip number In-phase amplitude 0 50 100 150 200 220 240 260 280 300 320 340 Round trip number Cut 616 4 NOVEMBER 2016 • VOL 354 ISSUE 6312 solution whose cut size is at least x% of the global optimum (maximum cut), as a function of graph size N, for random cubic graph instances. Error bars indicate 1 SD, which is dominated by the difference in difficulty between the various problem instances. (B) The runtime that would be required to obtain a solution of a particular accuracy with 99% probability. (C) The evolution of the in-phase components ci of the N = 100 OPO pulses as a function of the computation time, for a single run with the graph shown in the (D) inset. (D) The graph cut size achieved as a function of the computation time. (Inset) The graph being solved. (E) Observed success probability of obtaining a solution with a particular accuracy as a function of the density of edges in the graph. Experiments were performed on randomly generated N = 100-vertex graphs with fixed numbers of edges. Error bars indicate 1 SD. Graph Size N=|V | Success Probability 0 20 40 60 80 0 20 40 60 80 Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 -600 -400 -200 0 200 400 600 0 50 100 150 Graph D: |V|=100, |E|=495 C Computation Time (µs) 1 2 3 Graph Cut Size D Computation Time (μs) Simulation Experiment Simulation Experiment 39 ground state [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

Slide 40

Slide 40 text

All cubic graphs of N = 16 • MAX-CUT in cubic graph is NP-hard [E. Halperin, D. Livnat, and U. Zwick (SODA 2002)]. • 4060 instances for n = 16. 100 trials for each instance. 300 round trips for each trial. • Gradual coupling with dξ/dt = -0.001 and pump rate p = 0.88. • Pearson correlation coefficient ρ = 0.80, (cf. simulations between different parameters ρ ~ 0.9) 40 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Simulation Experiment 0 20 40 60 80 100 Simulation Experiment 0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 600 700 Success probability Number of Instances Number of instances A B 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Simulation Experiment 0 20 40 60 Simulation Experiment #Instances [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

Slide 41

Slide 41 text

Möbius ladder Möbius ladder 41 N = 4n ξ = -0.02 N = 4n+2 Ê Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ 5 10 15 20 25 30 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Problem size N Success rate Ground State Probability (%) Moebius Ladder Graphs Ising Energy -160 -140 -120 Ising Energy -100 -80 -60 Ising Energy Number of Runs -20 0 20 0 20 40 60 80 100 N=100 N=52 N=8 100% 48% 21% 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 Graph Size N =|V | MAX CUT / Ground State Energy Fig. 3. Results with various-size Möbius ladder graphs. (A) Observed probability of obtaining a ground state of the Möbius ladder graph in a single run, as a function of the size N of the graph. Multiple 100-run batches were performed for each graph size to obtain the standard deviations, which are shown as error bars. (B to D) Histograms of obtained solutions in 100 runs for the graphs shown in the insets. RESEARCH | REPORTS N = 4n [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)] Simulation Experiment

Slide 42

Slide 42 text

Computation Time Scaling • Time to 99% success (find ground state at least once in 99% probability). • Tcomp x log(1-0.99) / log(1-psuccess) (minimum k for (1-psuccess )k < (1-0.99) is required) • Exponential time to the ground states. Almost constant time to solution ≦ .94 OPT 42 The same spin configuration {si } represents a cut of size CðfsigÞ ¼ −1 2 X 1≤i

Slide 43

Slide 43 text

Effect of Edge Density • Lower success rate in dense graphs are due to insufficient pump (or short computation time) • For the dense graphs, it is easier to modulate the pump: gradual pumping 43 Graph Edges |E| Graph Edge Density d=2|E|/(|V|(|V|-1)) Success Probability (%) 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy 0 2000 4000 0 20 40 60 80 100 0 0.5 1 Simulation Experiment (p = 0.88) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) (p = 0.7) (p = 0.9) (p = 1.1) [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]

Slide 44

Slide 44 text

7. Performance Evaluation with Larger Problems [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)] [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Slide 45

Slide 45 text

Optimal scheduling for SA • N = 2000 complete graphs with weight wij = {-1, +1} • The time to SDP solution is ~ 5ms (SA on NII server; Intel Xeon CPU W3530 @ 2.8 GHz, 94 GB Memory). • Temperature scheduling is optimized (to use the different normalization of time) to various energy target. Solution by SDP (-60278) SA (200MCS; avg in 100 trials) 0 200 400 600 800 1000 0.006 0.008 0.010 0.012 0.014 Total step (MCS) Computation time (s) to SDP (-60278) -61000 -61500 -62000 0.00 0.01 0.02 0.03 0.04 -63000 -62000 -61000 -60000 -59000 -58000 Computation time (s) Ising energy SA 45 [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)]

Slide 46

Slide 46 text

G22: random G39: scale-free K2000 : complete Node degree distribution Best known 13359 2408 − SA 50 ms (Max in 100) 13336 2384 32781 SA 50 ms (Ave. in 100) 13298 2359 32314 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w∈{0, +1} 2000 nodes, 11778 edges edge weight: w∈{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w∈{-1, +1} SA CIM GW-SDP Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 G22: random G39: scale-free K2000 : complete Node degree distribution Best known 13359 2408 − SA 50 ms (Max in 100) 13336 2384 32781 SA 50 ms (Ave. in 100) 13298 2359 32314 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w∈{0, +1} 2000 nodes, 11778 edges edge weight: w∈{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w∈{-1, +1} SA CIM GW-SDP Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 Cut value 30000 31000 32000 33000 Amount 20 15 10 5 0 Amount 20 15 10 5 0 Cut value 13000 13200 13400 35 30 25 Cut value 2200 2300 2400 Amount 20 15 10 5 0 25 (50 ms) G22: random G39: scale-free K2000 : complete Node degree distribution Best known 13359 2408 SA 5 ms (Max in 100) 13245 2339 30118 SA 5 ms (Ave. in 100) 13183 2297 29308 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w{0, +1} 2000 nodes, 11778 edges edge weight: w{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w{-1, +1} SA CIM GW-SDP Table 1 Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 SA5msẚ㍑ (5 ms) (5 ms) (5 ms) Performance of solution quality (CUT histogram in 100 trials) 46 SA and SDP : Intel Xeon X5650 @ 2.67 GHz, Westmere (2010) [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)]

Slide 47

Slide 47 text

Table of approaches Deterministic Stochastic Binary neuron* Hopfield Network [Hopfield 1982] Simulated Annealing [Kirkpatrick, et al. 1982] Analog neuron Hopfield-Tank Neural Net [Hopfield&Tank 1986] CIM-cSDE *[McCulloch&Pitts 1943] Comment: Boltzmann Machine is also proposed around the time [Hinton&Sejnowski 1983] but the context is slightly different from optimization problems. 47 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Slide 48

Slide 48 text

Many-core Processor • PEZY-SC (MIMD architecture) @ 733 MHz • 8192 threads in parallel on chip (1024 cores x 8 threads on chip) • 3 TFlops (single precision) 48 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)] PEZY-SC (1024PE) Prefecture (256PE) City (16PE) Village (4PE) PE Program Counter × 8 L1 Instruction Cache 64bit × 256w (2KB) ALU 4FP ops/cycle Register File 32bit × 256w (1KB) Local Storage 32bit x 4096w (16KB) PE L1 Data Cache 2KB Village (4PE) Village (4PE) Village (4PE) Village (4PE) Special Function Unit City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) Prefecture (256PE) Prefecture (256PE) Prefecture (256PE) Prefecture (256PE) DDR4 DRAM 19GB/s PCIe Gen3 x8 PCIe Gen3 x8 PCIe Gen3 x8 PCIe Gen3 x8 L3 Data Cache 2MB DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s ARM 926 ARM 926 L2 Data Cache 64KB L2 Instruction Cache 32KB L3 Instruction Cache 128KB Atomic Cache 128B L1 Data Cache 2KB PE PE PE Thread0 front Thread1 front Thread3 front Thread2 front Thread1 back Thread0 back Thread3 back Thread2 back clock clock clock clock (B) (A)

Slide 49

Slide 49 text

Result : Time to target 49 Ensemble average in 100 trials (26 for experiment) with best/worst envelope. Note : worst 3 cases of HN didn’t reach the target. Best (ms) Avg (ms) Hardware CIM 0.071 0.264 DOPO+FPGA HN 0.924 1.84 CPU SA 2.10 3.20 CPU HTNN 7.04 9.67 Manycore CIM (experiment) HN SA HTNN 10-5 10-4 0.001 0.010 -30 -25 -20 -15 -10 -5 0 Time (s) Ising energy Compared to: Best Avg HN 13.0141 6.9697 SA 29.5775 12.1212 HTNN 99.1549 36.6288 Relative speed-up for CIM [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Slide 50

Slide 50 text

Result : N=20000 50 CIM (simulation) HN SA HTNN 10-5 10-4 0.001 0.010 0.100 1 -100 -80 -60 -40 -20 0 Time (s) Ising energy Best (ms) Avg (ms) Hardware CIM 0.14 0.15 Simulation HN 23 26 CPU SA 60 65 CPU HTNN 500 540 Manycore Compared to: Best Avg HN 164.3 173.3 SA 428.6 433.3 HTNN 3571.4 3600.0 Ensemble average in 100 trials with best/worst envelope. (simulation: c-number Langevin via truncated-Wigner) Relative speed-up for CIM [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Slide 51

Slide 51 text

8. Conclusion • Scalable implementation of measurement feedback CIM is proposed with pulsed fiber DOPO and FPGA • Validation with numerical simulation and experimental implementations of CIM with n = 100 and n = 2048 • In the complete graph of n = 2000, experimental CIM is 7.0 and 12.1 times faster than HN and SA implemented on Intel Xeon CPU 51

Slide 52

Slide 52 text

Publication list 1. Yoshitaka Haribara, Shoko Utsunomiya, Yoshihisa Yamamoto, Entropy 18, 151 (2016). 2. Kenta Takata, Alireza Marandi, Ryan Hamerly, Yoshitaka Haribara, Daiki Maruo, Shuhei Tamate, Hiromasa Sakaguchi, Shoko Utsunomiya, Yoshihisa Yamamoto, Scientific Reports 6, 34089 (2016). 3. Takahiro Inagaki, Yoshitaka Haribara, Koji Igarashi, Tomohiro Sonobe, Shuhei Tamate, Toshimori Honjo, Alireza Marandi, Peter L. McMahon, Takeshi Umeki, Koji Enbutsu, Osamu Tadanaga, Hirokazu Takenouchi, Kazuyuki Aihara, Ken-ichi Kawarabayashi, Kyo Inoue, Shoko Utsunomiya, Hiroki Takesue, Science 354, 603 (2016). 4. Peter L. McMahon*, Alireza Marandi*, Yoshitaka Haribara, Ryan Hamerly, Carsten Langrock, Shuhei Tamate, Takahiro Inagaki, Hiroki Takesue, Shoko Utsunomiya, Kazuyuki Aihara, Robert L. Byer, M. M. Fejer, Hideo Mabuchi, Yoshihisa Yamamoto, Science 354, 614 (2016). *equally contributed 5. Yoshitaka Haribara, Shoko Utsunomiya, Ken-ichi Kawarabayashi, Yoshihisa Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017). 6. Yoshitaka Haribara, Hitoshi Ishikawa, Shoko Utsunomiya, Kazuyuki Aihara, Yoshihisa Yamamoto, Quantum Science and Technology 2, 044002 (2017). 52

Slide 53

Slide 53 text

Appendix

Slide 54

Slide 54 text

Appendices Index • Appendix A: Slack variable with constraint optimization • Appendix B: Multimode simulation • Appendix C: Parallel SA implementation on multithreaded CPU • Appendix D: Order-N Monte Carlo • Appendix E: Iterative method • Appendix F: FPGA implementation

Slide 55

Slide 55 text

Pump rate normalization • The CSDE: • In the numerical simulation, 3 components are calculated separately. • gain: 10 steps, loss: instantaneous, coupling: instantaneous • The total loss of the cavity consists of 2: out-coupler loss γout and injection-coupler loss γinj . This should be 1 per unit time dt as in CSDE • When dt = .01, the parameters can be set as γout = .009, γinj = .001, i.e., (γout + γinj )/dt = 1. • Suppose the integration step in the gain medium is tg = 10, gain loss coupling 0.5 1.0 1.5 2.0 1 10 100 1000 Pump rate Photon number cf. NTT setup: 8.9 dB loss per round trip (3.3 dB in PPLN)

Slide 56

Slide 56 text

Lossy cavity (T = 0.25) Random graph g1 (N = 800), As = 400 ξ = -0.4/(avg.degree) = -0.0085 600 RoundTrips × 100 runs p=0.4 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude p = 0.6 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude p = 1.1 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude Amplitude @ each pump rate 1 N ⁄ i=1 N †ci § 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 Pump rate Average amplitude max avg 0.0 0.5 1.0 1.5 11000 11100 11200 11300 11400 11500 11600 Pump rate Number of cut Oscillation threshold #Cut ≦ 11603 11529.6

Slide 57

Slide 57 text

CUT values to SDP UB SDP UB SG SG3 GW CIM SA BLS g1 12083 0.90615 0.93925 0.93867 0.95630 0.95862 0.96201 g2 12089 0.91405 0.94251 0.94367 0.95583 0.95790 0.96120 g15 3171 0.90350 0.94387 0.93882 0.94954 0.95427 0.96184 g17 3171 0.90918 0.93914 0.93598 0.94765 0.95080 0.96090 g53 4009 0.90846 0.94138 0.93540 0.94812 0.95161 0.96034 cut values are compared to #E in the original paper SG, SG3 in [Kahruman et al. 05] GW and CIM by Haribara SA by Tamate BLS in [Benlic et al. 12]

Slide 58

Slide 58 text

Appendix A: Application

Slide 59

Slide 59 text

Boxed-Constrained Quadratic Programming (BCQP) • Objective function : • BCQP is NP-hard • Benchmark instances have discrete values of x: • Discrete steps r for variables x : 2 (Ising) ≦ r ≦ 13 • Quadratic coupling matrix : Qij ∈ {0, ±1, ±2(diag)} min f(x) = 1 2 xT Qx s.t. x 2 ⌦ = { 1  x  1 | x 2 Rn} • If Q is positive semide nite (convex) 㱺 solvable in polynomial time • Ellipsoid method • Gradient descent • Interior point method • If Q is negative semide nite or inde nite 㱺 NP-hard • Branch&Bound with SDP lower bounding [S. Burer, et al., D. Comput Optim Appl 43, 181 (2009)] • Local methods: [N.I.M. Gould, and P.L. Toint, Appl. Optim. 72, 149 (2002).] • Global methods: [P. De Angelis, et al., Developments in Global Optimization, pp. 73-94 (1997).] [P.M. Pardalos, et al., J. Glob. Optim. 1, 15 (1991); S. Burer, et al., D. Comput Optim Appl 43, 181 (2009)] 59

Slide 60

Slide 60 text

Parameter Sweep • Instance : n=10 spins of r=5 valued (x = -1, -1/2, 0, 1/2, 1), OPT = -15.25 Parameter search (Best in 100 runs for each parameter set) n=10, #7 (r=5) n=10, #7 (r=5) OPT = -15.25 n=10, #0 (r=5) OPT = -10.25 n=10, #6 (r=13) OPT = -13.58 60

Slide 61

Slide 61 text

Slack Variable [T. Leleu] PSJHJOBM QBSBNFUFS PQUJNJ[FE TMBDL WBSJBCMFT TVDDFTT range of parameters 0.6 ≦ p ≦ 1.3 0.06 ≦ ξ ≦ 1.4 dxi = [( 1 + p x2 i ei)xi X j Qijxj] dt + g dWi dei = ei[ 1/2 + IR+ (x xi) + IR+ (xi x+)]dt 61

Slide 62

Slide 62 text

From BCQP • constraint Σ xi = 1 • mean-variance portfolio optimization (convex) • max-clique/max-independent set [Motzkin and Straus] (with powerful heuristic solvers) • References: standard quadratic optimization problems [Bomze IM (1998)]; both applications [J Glob Optim 13:369– 387] • more linear constraints • general quadratic programming • below (continuous) & above (binary) threshold OPOs • Mixed-integer programming 62

Slide 63

Slide 63 text

Truss Topology Optimization [Giger+’06] Ai = {0, 1} Euler’s critical load Compression, tension, and buckling limit smin i si(Ai)  smax i sEuler i si(Ai) sEuler i = ⇡EiAi 4l2 i  0 Stress of member i • minimize: • subject to: W(A) = m X i=1 ⇢iliAi Material weight density Member length Weight of the structure Equilibrium of forces si = Fi/Ai (Ai = 1) P P 10-member Truss Ground Structure Ai si A Solution

Slide 64

Slide 64 text

Truss Topology Optimization on CIM W(A) = m X i=1 ⇢iliAi • Objective function minimize: • ci = 2Ai - 1: area of the truss member i (Knapsack-type problem) Strategy: 1. ໨తؔ਺ʹ੍໿߲ΛؚΊΔ (buckling constraint) 2. ৳ॖ੍໿ΛεϥοΫม਺ (slack variable) Ͱಋೖ V (c) = m X i=1 ⇢ilici + ↵ K X k=1 f(sE i m X i=1 sikci) Sigmoid Potential dci = [( 1 + p c2 i ei)ci (dV (c)/dci)]dt + gdWi dei = ei[ 1/2 + 1R >0 (s si) + 1R >0 (si s+)]dt

Slide 65

Slide 65 text

Appendix B: Multimode

Slide 66

Slide 66 text

4 2 0 2 4 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Longitudinal cavity mode m Ns Amplitude a b c d Normalized amplitude Si t 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time Multimode simulation • Hermite-Gauss modes in free-space cavity

Slide 67

Slide 67 text

All cubic graphs of n ≦ 16 • Success probability to find the ground state against cubic (3-regular) graphs of order ≦ 16. The parameters are fixed as (p, ξ) = (1.1, -0.1). • Note that the MAX-CUT is still NP-hard against cubic graphs [E. Halperin, et al., 2002] • Number of problems: N 4 6 8 10 12 14 16 #graphs 1 2 5 19 85 509 4060 Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ Ï Ï Ï Ï Ï Ï Ï Ú Ú Ú Ú Ú Ú Ú 4 6 8 10 12 14 16 0.2 0.4 0.6 0.8 1.0 Graph order N Success rate Success probability Number of vertices N ← Multimode avg. ± SD ← Single-mode avg. ± SD ← Multimode worst case ← Single-mode worst case Note: Up to 5 modes 67

Slide 68

Slide 68 text

n = 800 sparse graphs • Multimode͸ऑ͍݁ ߹ఆ਺Ͱ΋ͦͦ͜ ͜ͷղΛಘΔ͜ͱ ͕Ͱ͖Δ͕ɺ࠷େ ஋͸্͕Βͳ͍ɻ • Single-mode͸݁߹ ఆ਺ΛߋʹڧΊΔ ͱΑΓྑ͍ղΛಘ ΒΕΔɻ g1 cubic 68

Slide 69

Slide 69 text

Appendix C: CPU and processor Implementation

Slide 70

Slide 70 text

0.000 0.005 0.010 0.015 0.020 -32 -31 -30 -29 -28 -27 Time (s) Ising energy Difference in generation of CPU architecture Note: Intel Core i7-6567U @ 3.3 GHz Skylake (2016) was almost the same as Haswell GW-SDP target Intel Xeon X5650 2.67 GHz Westmere (2010) Intel Xeon E3-1225 v3 3.2 GHz Haswell (2013) (2.5x faster) 70

Slide 71

Slide 71 text

• Serious problem : the numerical simulation is very slow when the number of vertices N scales up even when we use light- weight SDEs • Accelerator configurations • CPU : all processes are calculated on CPU • CPU+GPU : matrix-vector product is off-loaded on GPU (float) • PEZY-SC : all calculations are parallelized (float) Bottleneck of SDE O(|E|) ≦ O(N2) in total Other part : O(N) in total 71 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]

Slide 72

Slide 72 text

Simulation time scaling • Complete graphs for different sizes • Simulation time for c-number Langevin eq. (200 round trips) • Spec: • CPU : Intel Xeon W3530, 2.80 GHz, Nehalem-WS (2010) • GPU : NVIDIA Tesla C2075, 1.15 GHz, Fermi (2011) w/cuBLAS • PEZY-SC, 1024 core (x8 threads), 733 MHz (2014) 72 O(N2) ?? depends on FPGA resources CPU CPU+GPU PEZY-SC CIM 500 1000 5000 104 10-4 0.001 0.010 0.100 1 10 100 Problem size N Simulation time (s) CPU → PEZY-SC 250x faster

Slide 73

Slide 73 text

SAͷฒྻԽ 1. ΞϓϦέʔγϣϯґଘͷฒྻԽ 2. ྖҬ෼ׂ (෼ׂ౷࣏๏) 3. ෳ਺Ϛϧίϑ࿈࠯ [A.M. Ferreiro, et al., J. Global. Optim. 57, 863 (2012)] ඇಉظ ಉظ sync. sync. sync. sync. (Sync. : ࠷খΤωϧΪʔঢ়ଶΛબͿ)

Slide 74

Slide 74 text

SAͷGPUʹΑΔฒྻԽ • ؔ਺࠷దԽ: ਖ਼نԽ Schwefelؔ਺ • NVIDIA FeForce GTX 470 • ಉظߋ৽ͷฒྻԽ͸Τ ϥʔݮ • ܭࢉ࣌ؒ 76x [A.M. Ferreiro, et al., J. Global. Optim. 57, 863 (2012)] 876 J Glob Optim (2013) 57:863–890 Table 1 Error of the solution obtained by the algorithm, both in the value of the function at the minimum (columns | fa − fr |, where fa is the objective function value found by the algorithm and fr is the exact function value in the real minimum) and in the minimum (columns relative error, measured in · 2) n V0 V1 V2 | fa − fr | Relative error | fa − fr | Relative error | fa − fr | Relative error 8 1.3190 × 10−1 2.4283 × 10−3 1.2891 × 10−2 7.4675 × 10−4 1.7000 × 10−5 4.1656 × 10−5 16 2.3712 × 10−1 3.2557 × 10−3 7.4586 × 10−2 1.8240 × 10−3 1.9000 × 10−6 5.0686 × 10−5 32 3.3774 × 10−1 3.8852 × 10−3 2.8171 × 10−1 3.5468 × 10−3 1.5730 × 10−4 6.0577 × 10−5 64 7.9651 × 10−1 5.9664 × 10−3 9.7831 × 10−1 6.6126 × 10−3 3.1880 × 10−4 1.2132 × 10−4 128 1.9198 9.2648 × 10−3 3.0461 1.1674 × 10−2 1.2225 × 10−4 1.5304 × 10−4 256 3.6230 1.2733 × 10−2 9.5765 8.0283 × 10−2 1.4953 × 10−2 8.2214 × 10−4 512 7.3054 1.8097 × 10−2 26.2282 4.0424 × 10−1 4.6350 × 10−1 4.5503 × 10−3 Table 2 Performance of CUDA version versus sequential version with one CPU core for different number of parameters n V0 V1 V2 Time Time Speedup Time Speedup 8 1,493.7686 5.5436 269.4595 5.6859 262.7121 16 2,529.3072 15.3942 164.3027 15.5889 162.2502 32 4,618.5820 56.9808 81.0550 60.1882 76.7356 64 8,773.0560 106.6075 82.2930 110.2702 79.5596 128 17,169.0000 210.9499 81.3890 215.5416 79.6552 256 34,251.9240 455.4910 75.1978 462.8035 74.0096 512 68,134.5760 871.7434 78.1589 893.7668 76.2330 (ඇಉظ) (ຖεςοϓಉظ) (γϦΞϧ) (ඇಉظ) (ຖεςοϓಉظ) (γϦΞϧ) ical experiments have been performed with the following hardware and software configu- rations: a GPU Nvidia Geforce GTX470, a recent CPU Xeon E5620 clocked at 2.4 Ghz with 16 GB of RAM, CentOS Linux, NVIDIA CUDA SDK 3.2 and GNU C++ compiler 4.1.2. In what follows, we denote by V0 the sequential implementation, by V1 the parallel asynchronous version and by V2 the parallel synchronous one. 4.1 Analysis of a sample test problem: normalized Schwefel function A typical benchmark for testing optimization techniques is the normalized Schwefel function (see [31], for example): f (x x x) = − 1 n n i=1 xi sin |xi | , −512 ≤ xi ≤ 512, x x x = (x1, . . . , xn). (6) For any dimension n, the global minimum is achieved at the point x x x , the coordinates of which are x i = 420.968746, i = 1, . . . , n, and f (x x x ) = −418.982887. Table 1 illustrates the accuracy for the three versions of the SA algorithm: sequential (V0), asynchronous (V1) and synchronous (V2). For these three versions we use the following configuration: T0 = 1000, Tmin = 0.01, N = 100 and ρ = 0.99. For the parallel versions we use the choice b = 256 and g = 64, for the number of threads per block (block size) and the number of blocks per grid (grid size), respectively, so that the number of Markov chains is 16,384. With this configuration, the algorithm performs 1.8776×109 function evaluations in all cases. 123 code, following the ideas of Sect. 2.2, so that both codes ons and their performance can thus be compared. The numer- ormed with the following hardware and software configu- GTX470, a recent CPU Xeon E5620 clocked at 2.4 Ghz inux, NVIDIA CUDA SDK 3.2 and GNU C++ compiler by V0 the sequential implementation, by V1 the parallel the parallel synchronous one. blem: normalized Schwefel function ptimization techniques is the normalized Schwefel function |xi | , −512 ≤ xi ≤ 512, x x x = (x1, . . . , xn). (6) minimum is achieved at the point x x x , the coordinates of = 1, . . . , n, and f (x x x ) = −418.982887. y for the three versions of the SA algorithm: sequential (V0), nous (V2). For these three versions we use the following = 0.01, N = 100 and ρ = 0.99. For the parallel versions = 64, for the number of threads per block (block size) and rid size), respectively, so that the number of Markov chains n, the algorithm performs 1.8776×109 function evaluations 123 ( x ∈ Rn) Image source : [http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/TestGO_files/Page2530.htm]

Slide 75

Slide 75 text

Speedup for SA (OpenMP) Concurrency #flips time 1 27011.7 0.00530 2 19722.9 0.00437 4 16302.1 0.00448 8 9538.8 0.00408 16 6200.5 0.00838 24 4635.5 0.01014 n = 2000 Intel Xeon E5-2680 @ 2.50 GHz with OpenMP Weighted complete graph K2000 (Jij = ±1) (Courtesy of T. Sonobe & K. Kawarabayashi @ NII) 1.3x 75 n = 105 Concurrency #flips time 1 2218790.8 4.258 2 1358508.9 3.132 4 923020.6 2.479 8 652692.7 2.124 16 533915.5 2.194 24 471558.2 2.527 2.0x

Slide 76

Slide 76 text

Appendix D: Monte Carlo

Slide 77

Slide 77 text

Single spin flip O(N) Monte Carlo • Partition function : • k = (k1, …, kNb) ͷ૊߹ͤͰల։ mean Poisson distribution Inverse temperature Positive parameter =:W 77

Slide 78

Slide 78 text

SA with single spin flip O(N) MC • Complete graphs (J=±1-weighted with uniform distribution) • Scaling : O(βεN), while naive MC scales as O(N2) 0 50 100 150 200 250 300 -30 -25 -20 -15 -10 -5 0 Monte Carlo sweeps Ising energy Naive MC (2.0 s) = 2 (24.9 s) = 4 (36.6 s) = 6 (47.3 s) = 8 (59.0 s) = 10 (70.0 s) cf. Constant temperature T = 1.0, ε = 0.1 Simulated Annealing N = 2000, β(t) = 40 log(1+10t), t∈[0,1] Naive Monte Carlo Order-N Monte Carlo 10 100 1000 104 10-6 10-5 10-4 0.001 0.010 0.100 Number of spins N CPU time per MCS (s) 78

Slide 79

Slide 79 text

Appendix E: Extended Implementation

Slide 80

Slide 80 text

w12 =2 v1 v2 v3 v4 w34 =3 w23 =1 w24 =3 w14 =-2 w 13=-2 v1 v2 v3 v4 v1 v2 v3 v4 v1 v2 v3 v4 time (in single run) Jij ={0,±1} Weighted graph Temporal graph Weight decomposition weight distribution -3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 Edge weight w Frequency Calculate from large (significant) weight 80

Slide 81

Slide 81 text

Variant of iterative method Method coupling Jij CUT magnify 2k scaling 280.5 2bit-itr single bit 297.4 accumulate or from MSB 370.4 accumulate 2bit-itr bit representation wij = 0 0 1 0 0 1 … 1 1 0 1 0 digit : k … 4 3 2 1 0 ↓ ↓ ↓ ↓ ↓ 2k 2k-4 24 23 2 0 0 1 1 1 1 … 1 1 1 1 1 MSB accumulate bit (or operation): Jij is on if wij > threshold Jij is 2bit (on/off, +1/-1) 81

Slide 82

Slide 82 text

Torus with Gaussian dist. • torusg3-8 : 3D toroidal graph (degree 6) with Gaussian weight: wij ∈ N(0, 1), equivalent to 3D Edwards-Anderson spin glass model • |V| = 512, |E| = 1536, 20-bit weight Implementation Cut SDP UB 457.36 BestKnown 416.85 SA 392.23 continuous Jij ∈ R 393.59 proposed (itr) 385.57 Jij ∈ {-1,0,+1} 354.10 [http://dimacs.rutgers.edu/Challenges/Seventh/Instances/] 280 300 320 340 360 380 0.00 0.05 0.10 0.15 0.20 0.25 CUT Probability 280 300 320 340 360 380 0.00 0.05 0.10 0.15 0.20 0.25 CUT Probability t = 800 best continuous accumulate weight distribution -3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 Edge weight w Frequency best in 1000 trials 82

Slide 83

Slide 83 text

torusg3-15 from 7-th DIMACS • 3D toroidal graph (degree 6) with Gaussian weight: wij ∈ N(0, 1) • |V| = 153 = 3375, |E| = 10125 Implementation Cut SDP UB 3134.60 BestKnown 2602.03 SA 2367.84 continuous Jij ∈ R 2560.43 proposed (itr) 2470.80 Jij ∈ {-1,0,+1} 2214.50 [http://dimacs.rutgers.edu/Challenges/Seventh/Instances/] 10 trials 83

Slide 84

Slide 84 text

Appendix F: FPGA [patent]