Slide 1

Slide 1 text

Computational Results Stanley Osher and Wuchen Li et.al. UCLA and U of SC 14

Slide 2

Slide 2 text

Understand score-based generative models via lens of Wasserstein proximal operators Stanley Osher, Siting Liu University of California, Los Angeles Benjamin Zhang, Markos Katsoulakis University of Massachusetts, Amherst Wuchen Li University of South Carolina August 2024

Slide 3

Slide 3 text

Fundamental mathematical nature of score-based generative models I A fundamental characterization of score-based generative models as Wasserstein proximal operators (WPO) of cross-entropy I Mean-field games build a bridge to mathematically equivalent alternative formulations of SGM I Yields explainable formulations of SGMs grounded in theories of information, optimal transport, manifold learning, and optimization I Uncovering mathematical structure of SGMs explains memorization, and informs practical models to generalize better; suggests new practical models with interpretable mathematically-informed structure that train faster with less data. 2

Slide 4

Slide 4 text

Wasserstein proximal operator I The Wasserstein metric, also known as the optimal transport (OT) distance, is a distance function defined between probability distributions and is denoted as Wp(µ, ⌫). I For p = 2, ⌧ > 0, a given probability density µ , the Wasserstein proximal operator (WPO) of the some function V (x) is defined as: ⇢ := WProx⌧V (µ) := arg min q2P2(Rd) ⇢Z Rd V (x)q(x)dx + W2(µ, q)2 2⌧ I Set V (x) = log ⇡(x) of some reference distribution ⇡, the first term is the cross-entropy of ⇡ with respect to ⇢. ⇢ = arg min ⇢ Z Rd q log ⇡dx + W2(µ, q)2 2⌧ µ (source) ! ⇡ (target), redistribution + transport 3

Slide 5

Slide 5 text

Regularized Wasserstein Proximal Operator I Following Benamou-Brenier formulation of OT, the dynamic formulation of the Wasserstein proximal operator is indeed a first order mean-field game system. I The regularization is done by adding viscosity ⇢ through the dynamic formulation of optimal transport: inf ⇢,v ⇢Z 1 0 Z ⌦ 1 2 ⇢(x, t)kv(x, t)k2dx dt + Z ⌦ V (x)⇢(x, 1)dx s.t. ⇢t + r · (⇢v(x, t)) = ⇢, ⇢(x, 0) = µ(x) I Regularized WPO: ⇢ := WProx⌧V, (µ) := arg min nR Rd V (x)q(x)dx + W2(µ,q)2 2⌧ o . I With regulation, we obtain a closed-form formula of ⇢, which allows fast computation.1 I The KKT condition of such PDE-constrained optimization problem gives a second order mean-field game system: ( @U @t + 1 2 |rU|2 = U, U(x, 1) = V (x) @⇢ @t r · (⇢rU) = ⇢, ⇢(x, 0) = µ(x) 1 Wuchen Li, Siting Liu, and Stanley Osher, “A kernel formula for regularized Wasserstein proximal operators.” Research in the Mathematical Sciences 4

Slide 6

Slide 6 text

Deriving SGM from regularized WPO I Score-based generative models (SGM) first perturb data with an SDE, then reverse the SDE for sample generation, where reversing guided by score function s(x, t) = rx log p(x, t). – Forward SDE (data ! noise) dX(t) = dW(t) – Reverse SDE (noise ! data) dY (t) = 2sdt + dW(t) I Set V (x) = log ⇡(x), = 2/2 in regularized WPO, and apply Cole-Hopf transform with a time reparametrization U(x, t) = 2 log ⌘(x, T t): ( @⌘ @s = 2 2 ⌘, ⌘(y, 0) = ⇡(y) @⇢ @t + r · (⇢ 2r log ⌘) = 2 2 ⇢, ⇢(x, 0) = µ(x) I SGM can be summarized with the forward and inverse WPO: ⇡ ⇡ ˜ ⇡ = WProx 2T H, 2/2 (WProx 1 2T H, 2/2 (ˆ ⇡)), where ˆ ⇡ is the empirical distribution defined by samples {Xi }N i=1 ⇠ ⇡. 5

Slide 7

Slide 7 text

Kernel representation for the score and memorization For initial condition ⌘(x, 0) = ⇡(x), the score function s(x, t) = r log ⌘(x, t) has the kernel representation formula s(x, t) = rU(x, t) = 2 (rG 2/2,T t ⇤ ⇡)(x) (G 2/2,T t ⇤ ⇡)(x) where G is the heat kernel. I Memorization: appealing to the forward and inverse Wasserstein proximal notation, using this empirical score score formula is equivalent applying the relation: ˆ ⇡ = WProx 2T H, 2/2 (WProx 1 2T H, 2/2 (ˆ ⇡)), That is, using the exact score function simply returns the training data. I Generalization: Consider a generalization of the empirical distribution by Gaussian kernels. ˆ ⇡✓ (x; {Zi }N i=1 ) = 1 N N X i=1 det ✓ (Zi) (2⇡)d/2 exp ✓ (x Zi)> ✓ (Zi)(x Zi) 2 ◆ , I Learn local covariance matrix ✓ (·) near each kernel center use neural networks. Learning the local precision matrices is manifold learning. 6

Slide 8

Slide 8 text

A kernel model that generalizes I Enforce the terminal condition of HJ equation, which is equivalent to implicit score-matching. I Learning local covariance matrix is akin to manifold learning, which is something SGM has been empirically observed to do. 7

Slide 9

Slide 9 text

Examples: Deconstructing SGM 8

Slide 10

Slide 10 text

Examples: Deconstructing SGM The 6D example highlights its e↵ectiveness in moderate dimensions, showcasing a 3D Swiss roll noisily embedded in 6D space. 9

Slide 11

Slide 11 text

Example: learn the data manifold Example with 2 moon dataset. I Red ellipses denote local covariance matrices I Set of local covariance matrices define Riemannian metric, and therefore a manifold 10

Slide 12

Slide 12 text

Takeaways I Faster training with less data due to mathematically-informed structure of the kernel model, resolving memorization. – Proper choice of kernel (solves HJB equation). – Manifold learning (terminal condition of HJB, proximal interpretation). I Requires no simulation of SDEs: Kernel model can be sampled from directly. I Formulation provides new ideas of implementations – New bespoke neural nets for score-based models for scalable implementations. – Tensors instead of neural networks in manifold learning. 11

Slide 13

Slide 13 text

Tensor Train Based Sampling Algorithms for Approximating Regularized Wasserstein Proximal Operators Fuqun Han (UCLA), Stanley Osher (UCLA), Wuchen Li (USC) August 15, 2024 arXiv:2401.13125

Slide 14

Slide 14 text

Introduction I Objective: Obtain samples from a known probability distribution ⇢⇤(x) = 1 Z exp( V (x)) . I The classical overdamped Langevin dynamics for sampling is dX(t) = rV (X(t)) dt + p 2/ dW(t) , where W(t) is the standard Wiener process (Brownian motion). I We consider the score function based particle evolution equation dX dt = rV (X) 1r log ⇢(t, X) . I To approximate the score function, we utilize the regularized Wasserstein proximal operator (RWPO) ⇢T = argmin q inf v,⇢ Z T 0 Z Rd 1 2 ||v(t, x)||2⇢(t, x) dx dt + Z Rd V (x)q(x) dx , @⇢ @t + r · (⇢v) = 1 ⇢, ⇢(0, x) = ⇢0(x), ⇢(T, x) = q(x). 2

Slide 15

Slide 15 text

Kernel Formula to Approximate Score Function 1 I Introducing a Lagrangian multiplier, the RWPO is equivalent to 8 > < > : @t⇢ + rx · (⇢rx ) = 1 x⇢, @t + 1 2 ||rx ||2 = 1 x , ⇢(0, x) = ⇢0(x), (T, x) = V (x). which has a Fokker-Planck and a Hamilton-Jacobi equation. I With the Hopf-Cole transformation, the RWPO can be solved as ⇢T (x) = Z Rd exp ⇥ 2 V (x) + ||x y||2 2 2T ⇤ R Rd exp ⇥ 2 V (z) + ||z y||2 2 2T ⇤ dz ⇢0(y)dy, where ⇢T approximates the evolution of the Fokker-Planck equation @⇢ @t = r · (⇢rV ) + 1 ⇢, ⇢(0, x) = ⇢0(x). 1W. Li, S. Liu, S. Osher, A kernel formula for regularized Wasserstein proximal operators, 2023 3

Slide 16

Slide 16 text

Algorithm I With the density function from the kernel formula, the score function at time tk + h is approximated by r log ⇢tk+h = r⇢tk+h ⇢tk+h . I The score function at time tk+h allows us to implement the semi-implicit Euler discretization to the score dynamic Xk+1 = Xk h(rV (Xk) + 1r log ⇢tk+h(Xk)). I The high-dimensional function in the kernel formula (density function and heat kernel) is approximated by tensor train f(x1, . . . , xd) ⇡ r1 X ↵1=1 · · · rd X ↵d=1 f1(x1, ↵1) · · · fd(↵d 1, xd). The computational complexity of evaluating the kernel formula is linear with respect to dimension d. 4

Slide 17

Slide 17 text

Example I: Sampling with Non-Gaussian Distribution We consider a sampling task involving a bimodal distribution: ⇢⇤(x) = 1 Z exp( 2(||x|| a)2) ⇥ exp 2(x1 a)2 + exp 2(x1 + a)2 ⇤ . -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 TT-BRWP ULA Figure: Evolution of sample points for di↵erent algorithms after 30 iterations. The contour levels are 0.8, 0.4, and 0.1 in 20 iterations. Faster convergence to the stationary distribution. 5

Slide 18

Slide 18 text

Example II: Solving Bayesian Inverse Problems Solving nonlinear elliptic inverse problems with noisy measurement data by sampling from a posterior distribution. 2 100 200 300 400 500 600 700 800 900 1000 Iterations 0.1 0.2 0.3 0.4 0.5 0.6 0.7 TT-BRWP BRWP MALA 5 10 15 20 25 30 35 40 45 50 55 60 iterations 0 0.2 0.4 0.6 0.8 1 1.2 TT-BRWP BRWP MALA Step size h = 10 3 Step size h = 0.1 Figure: L2 error in parameter estimation with di↵erent step sizes. (BRWP is the algorithm without tensor train integration.) More robust for larger step sizes. 2A. Stuart, Inverse problems: A Bayesian perspective, 2010. 6

Slide 19

Slide 19 text

Ongoing Works and Future Directions I Convergence and mixing time for non-Gaussian log-concave distributions: Utilizing score dynamics enables faster convergence compared to di↵usion generated by Brownian motion. I Optimal parameter selection: Investigating the optimal choice of h and . The semi-backward Euler discretization permits larger and more e↵ective step sizes. I Applications: Exploring applications in areas such as group symmetric distributions, composite density functions, global optimization, and time-reversible di↵usion. 7

Slide 20

Slide 20 text

A deep learning algorithm for mean field control problems via forward-backward score dynamics Mo Zhou, Stan Osher, and Wuchen Li 2024.8 https://arxiv.org/pdf/2401.09547

Slide 21

Slide 21 text

Mean-field control problem Minimize the cost functional inf v Z T 0 Z Rd [L(t, x, v(t, x))⇢(t, x) + F(t, x, ⇢(t, x))] dx dt+ Z Rd V (x)⇢(T, x) dx subject to the Fokker-Planck equation @t⇢(t, x) + rx · (⇢(t, x)v(t, x)) = 1 x⇢(t, x), ⇢(0, x) = ⇢0(x). or stochastic dynamic dXt = v(t, Xt) dt + p 2/ dWt , X0 ⇠ ⇢0. 2

Slide 22

Slide 22 text

HJB equation that characterize optimality Hamiltonian: Legendre transform of the Lagrangian H(t, x, p) = sup v2Rd v > p L(t, x, v). HJB equation: @t (t, x) + H(t, x, rx (t, x)) + 1 x (t, x) = @F @⇢ (t, x, ⇢(t, x)) Optimal velocity: v ⇤(t, x) = DpH(t, x, rx (t, x)) 3

Slide 23

Slide 23 text

Probability flow: stochastic ! deterministic Capture the stochastic dynamic in a deterministic way FP equation: @t⇢(t, x) + rx · (⇢(t, x)v(t, x)) = 1 x⇢(t, x) @t⇢(t, x) + rx ·  ⇢(t, x) ✓ v(t, x) 1 rx log ⇢(t, x) ◆ = 0 stochastic dynamic: dXt = v(t, Xt) dt + p 2/ dWt probability flow: dxt = ✓ v(t, xt) 1 rx log ⇢(t, xt) ◆ dt If x0 follows the distribution ⇢0 , then the density for xt precisely corresponds to the solution of the FP equation. 4

Slide 24

Slide 24 text

Forward-backward score system and deep learning implementation Ours: Forward-backward score ODE system 8 > > > > < > > > > : @txt = DpH(t, xt, zt) 1 rx log ⇢(t, xt), @tyt = L(t, Xt, DpH(t, xt, zt)) + ft 1 ht 1 z > t rx log ⇢(t, xt), x0 ⇠ ⇢0 , yT = V (xT ) where yt = (t, xt), zt = rx (t, xt) and ht = x (t, xt). Discretization error: O( t). Traditional: forward-backward SDE system 8 > < > : dXt = DpH(t, Xt, Zt) dt + p 2/ dWt, dYt = (L(t, Xt, DpH(t, Xt, Zt)) + f(t, Xt, ⇢(t, Xt))) dt + p 2/ Z > t dWt X0 ⇠ ⇢0 , YT = V (XT ). The unique solution to the FBSDE above is Yt = (t, Xt) and Zt = rx (t, Xt). Discretization error: O( p t). 5

Slide 25

Slide 25 text

Numerical implementation Construct a neural network for (t, x), denoted by N (t, x) Loss function: match the dynamic yt and its neural network parametrization N (t, xt) E Z T 0 | N (t, xt) yt| 2 dt, 6

Slide 26

Slide 26 text

Numerical results inf v Z T 0 Z Rd ✓ 1 2|v(t, x)| 2 + 1 2|x| 2 + log(⇢(t, x)) + ◆ ⇢(t, x) dx dt + Z Rd V (x)⇢(T, x) dx s.t. @t⇢(t, x) + rx · ⇢(t, x)rx (t, x) = x⇢(t, x) Figure: Left: training curve; middle: plot for function; right: density plot. 7

Slide 27

Slide 27 text

Structured behavior and less error of score dynamic Figure: First row: score dynamic for particles. Second row: stochastic dynamic for particles. The score dynamic demonstrates a more structured behavior. errors (1d | 2d) rx r2 x rx r2 x score method 1.59% 0.52% 0.51% 1.82% 0.54% 0.53% BSDE method 2.77% 2.63% 2.36% 1.92% 2.32% 2.14% Future work: E ciently compute the score dynamic. 8

Slide 28

Slide 28 text

A primal-dual algorithm for reaction-di↵usion equations Shu Liu, Siting Liu, Stanley Osher, Xinzhe Zuo, Wuchen Li August, 2024 arXiv links: https://arxiv.org/abs/2305.03945 https://arxiv.org/abs/2401.14602.

Slide 29

Slide 29 text

Introduction I Goal We propose an easy-to-implement iterative method for resolving the time-implicit schemes arising in solving reaction-di↵usion (RD) type equations. I Our treatment – Formulate as a min-max saddle point problem; – Apply the primal-dual hybrid gradient (PDHG) method. – Suitable precondition matrices are applied to accelerate the convergence of algorithms. I The method is applicable to various discrete numerical schemes with high flexibility. I The method can e ciently produce numerical solutions with su cient accuracy. I Extension to adversarial training of Physics Informed Neural Networks (PINNs). E↵ectiveness is verified by numerical results. 2

Slide 30

Slide 30 text

Formulations Reaction-di↵usion (RD) type equation @tu(x, t) = G(aLu(x, t) + bf(u(x, t))), u(·, 0) = u0. Periodic/Neumann boundary condition. • a, b 0, f(u) is nonlinear reaction term. • Allen-Cahn G = Id, L = , f(u) = u3 u; • Cahn-Hilliard G = L = , f(u) = u3 u. Suppose L, G are the space discretization of L, G. The time-implicit scheme Ut Ut 1 ht = G(aLUt + bf(Ut)), 1  t  Nt. reduces to the following time-accumulated nonlinear equation F(U) = 0, where F(U) = DU V + htG (aL U + bf(U)). Here U 2 RNt·Nx , F : RNtNx ! RNtNx , where Nx is the number of space discretization. D denotes time discretization matrix, V 2 RNtNx is constant vector carrying boundary and initial data. G = INt ⌦ G, L = INt ⌦ L. 3

Slide 31

Slide 31 text

Methods Solving F(U) = 0 is equivalent to minimizing 1 2✏ kF(U)k2. This is equivalent to solving the saddle point problem min U2RN max P 2RN {P>F(U) ✏ 2 kPk2}. But F(U) could be ill-conditioned. Need preconditioning! I Consider precondition matrix M, e.g. M = D + htG (aL + bf0(U)I). (U denotes the equilibrium state. ±1 for Allen-Cahn or Cahn-Hilliard equations.) I replace F(·) by b F(·) = M 1F(·). (M 1 can be handled with FFT or DCT.) Apply Primal-Dual Hybrid Gradient method to min U max P P> b F(U) ✏ 2 kPk2: Pk+1 = 1 1 + ✏⌧P (Pk + ⌧P (M 1F(Uk))); e Pk+1 = Pk+1 + !(Pk+1 Pk); Uk+1 = Uk ⌧U ((M 1DF(Uk))> e Pk+1). ⌧U , ⌧P are stepsizes of PDHG algorithm, ! is the extrapolation coe cient. 4

Slide 32

Slide 32 text

Convergence of preconditioned PDHG Suppose f(·) is Lipschitz; Lh Gh = Gh Lh . Denote R(U) = f(U) f(U) f0(U)(U U). Assume ht , Nt satisfy I (Allen-Cahn type) Nt · ht < p 2 1 bLip(R) ; I (Cahn-Hilliard type) ht < 4( p 2 1)2a b2(Lip(R) ( p 2 1)c)2 + , Nt  j ( p 2 1) 2 p a/ht+bc bLip(R) k . Then we have: I There exists a unique solution U⇤ to F (U) = 0. I One can pick hyperparameters ⌧P , ⌧U , !, ✏, s.t. Uk converges linearly to U⇤, kUk U⇤k2 2  Constant · 2 + p 2 + 4 ! k+1 , where > 0 is independent of the space discretization. 40 60 80 100 120 140 160 180 200 220 240 260 N x 50 100 150 200 250 300 350 400 450 Iterations needed for convergence Iterations needed for convergence vs N x 6th Order VarCoeff Cahn-Hilliard Allen-Cahn 5

Slide 33

Slide 33 text

A 6th order reaction-di↵usion equation We consider the 6th-order Cahn-Hilliard-type equation depicting pore formation in functionalized polymers [Gavish et al. 2012]. @u(x, t) @t = (✏2 0 W00(u) + ✏2 0 )(✏2 0 u W0(u)) on [0, 2⇡]2, u(·, 0) = u0. Set ✏ = 0.18 with initial condition u0 = 2esin x+sin y 2 + 2.2e sin x sin y 2 1. The equation is imposed with the periodic b.c. When we set up the precondition matrix M, we replace original G = h(✏2 0 h diag(W 00(U)) + ✏2 0 I) by e G = h(✏2 0 h W 00(¯ u)I + ✏2 0 I), where W 00(¯ u) = W 00(±1) = 2. (a) t = 0.0 (b) t = 0.1 (c) t = 2.0 (d) t = 20.0 Figure: Numerical solution at di↵erent time stages. 6

Slide 34

Slide 34 text

Ongoing research: Apply the preconditioned PDHG method to adversarial training of PDEs I General partial di↵erential equation (PDE) Lu = f on ⌦ ⇢ Rd. I infu sup L(u, ) , hLu f, i ✏ 2 k k2. Parameterize u, by neural networks u✓, ⌘. I Natural Primal-Dual Gradient (NPDG) – PDHG with suitable preconditioning. I Comparison with classical methods on a 20-dimensional elliptic equation with variable coe cients. 7

Slide 35

Slide 35 text

A primal-dual hybrid gradient method for solving optimal control problems and the corresponding Hamilton-Jacobi PDEs Tingwei Meng+, Siting Liu+, Wuchen Liú, Stanley Osher+ úDepartment of Mathematics, University of South Carolina +Department of Mathematics, UCLA August 4, 2024 Meng, Liu, Li, Osher PDHG control August 4, 2024 1 / 5

Slide 36

Slide 36 text

Motivation Solving the optimal control problem min –(·) ;⁄ T t L(“(s), s, –(s))ds + g(“(T)): “(t) = x, ˙ “(s) = f (“(s), s, –(s))’s œ (t, T) < , (1) and the corresponding Hamilton-Jacobi (HJ) PDE Y ] [ ˆÏ(x, t) ˆt + sup –œRm {≠Èf (x, T ≠ t, –), Òx Ï(x, t)Í ≠ L(x, T ≠ t, –)} = 0, x œ , t œ [0, T], Ï(x, 0) = g(x), x œ . (2) Grid based method (e.g., Lax–Friedrichs, Engquist-Osher scheme) needs to satisfy the CFL condition in discretization. In this project, we provide an optimization method for solving certain HJ PDEs with a saddle point formulation and implicit time discretization. This allows us to choose a larger time stepsize and avoid CFL condition. In the literature, there are some optimization algorithms for solving HJ PDE [Darbon, Dower, Osher, Yegerov, . . . ]. Compared to these methods, the optimization algorithm proposed in this project can handle a more general Hamiltonian that depends on (x, t). Partial theoretical guarantee is provided. Meng, Liu, Li, Osher PDHG control August 4, 2024 2 / 5

Slide 37

Slide 37 text

Continuous formula Solving the original problem (1) and (2) is equivalent to the following optimization problem min Ï satisfying (2) ≠c ⁄ Ï(x, T)dx, (3) where c œ R is a constant used to stabilize the method. Introducing the Lagrange multiplier fl, we get min Ï Ï(x,0)=g(x) max fl ⁄ T 0 ⁄ fl(x, t) 3 ˆÏ(x, t) ˆt + sup –œRm {≠Èfx,t (–), Òx Ï(x, t)Í ≠ Lx,t (–)} 4 dxdt ≠ c ⁄ Ï(x, T)dx. (4) Under certain conditions, this becomes a saddle point problem min Ï Ï(x,0)=g(x) max flØ0,– ⁄ T 0 ⁄ fl(x, t) 1 ˆÏ(x, t) ˆt ≠ Èfx,t (–(x, t)), Òx Ï(x, t)Í ≠ Lx,t (–(x, t)) 2 dxdt ≠ c ⁄ Ï(x, T)dx. (5) We discretize and then employ the PDHG method to solve this saddle point problem. We use implicit time discretization to avoid the CFL condition. Meng, Liu, Li, Osher PDHG control August 4, 2024 3 / 5

Slide 38

Slide 38 text

Numerical results: a quadratic Hamiltonian We solve the optimal control problem with dynamics f (x, t, –)I = ≠(|xI ≠ 1|2 + 0.1)–I for I = 1, 2 and Lagrangian L(x, t, –) = 1 2 |–|2. Then, the Hamiltonian is H(x, t, p) = 1 2 q 2 I=1 (|xI ≠ 1|2 + 0.1)2 p 2 I . The terminal cost function is g(x) = q 2 I=1 sin fixI , over the spatial domain [0, 2]2 under periodic boundary conditions. Left: solution to the HJ PDE, right: optimal trajectories. (a) (x, y) ‘æ Ï(x, y, t) at di erent t (b) Optimal trajectories “ú Meng, Liu, Li, Osher PDHG control August 4, 2024 4 / 5

Slide 39

Slide 39 text

Numerical results: a 1-homogeneous Hamiltonian We solve a 1D problem with the same f and g as last page. The Lagrangian is an indicator function of the unit ball in the ¸Œ space. The Hamiltonian is H(x, t, p) = (|x ≠ 1|2 + 0.1)|p| (1-homogeneous with respect to p). Top left: solution to HJ PDE; top right: feedback optimal control; bottom left: optimal trajectories; bottom right: optimal controls. Meng, Liu, Li, Osher PDHG control August 4, 2024 5 / 5