Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lecture note in Oberwolfach Seminar 2023

Wuchen Li
November 21, 2023
170

Lecture note in Oberwolfach Seminar 2023

Variational and stochastic flows are now ubiquitous in machine learning and generative modeling. Indeed, many such models can be interpreted as flows from a latent distribution to the sample
distribution, and training corresponds to finding the right flow vector field. Optimal transport and diffeomorphic flows provide powerful frameworks to analyze such trajectories of distributions with elegant notions from differential geometry, such as geodesics, gradient, and Hamiltonian flows. Recently, mean-field control and mean-field games offer general optimal control variational problems on the learning problem. How do these tools lead us to a better understanding and further development of machine learning and generative models?

Wuchen Li

November 21, 2023
Tweet

More Decks by Wuchen Li

Transcript

  1. Generalized optimal transport: optimization and mean field control in probability

    spaces Wuchen Li University of South Carolina, Columbia Oberwolfach seminar Nov 22, 2023. 1
  2. Generalized optimal transport I Optimization in probability space. – Gradient

    flows; – Accelerated flows; – Newton’s flows; – Convexity analysis; I Optimal control in probability space: – Hamiltonian flows: – Mean field control; – Mean field games; I Applications: – Regularized Wasserstein proximal operators; – Mean field modeling of pandemics propagation; – Scientific computing: Deep JKO schemes; 2
  3. Topic I: Optimization in probability space I Accelerated information gradient

    flows. I Information Newton’s flow: second-order optimization method in probability space. I Interacting Langevin Di↵usions: Gradient Structure and Ensemble Kalman Sampler. 3
  4. Optimization I Optimization problems for sampling algorithms attract increasing attentions

    from AI and machine learning communities. I Examples are given in model (finite di↵erence, finite element, neural networks, kernels) and model free ones: Variational inference, Bayesian inference, Generative Adversary Networks (GAN), and policy optimizations, etc. 6
  5. Sampling Algorithm Arguably, one of the most classical sampling algorithm

    is designed by the over-damped Langevin dynamics: dXt = rf(Xt)dt + p 2dBt, where Xt 2 Rd and Bt is the standard Brownian motion in Rd. Denote Xt ⇠ ⇢t . The evolution of ⇢t satisfies the Fokker-Planck equation @t⇢t = r · (⇢t rf) + ⇢t. The invariant distribution satisfies ⇢ ⇤ (x) = 1 K e f(x) , K = Z e f(y) dy. 8
  6. Optimization in probability space Nowadays, the sampling algorithm relates to

    the variational problem in probability density space P(⌦) = n ⇢ 2 F(⌦): Z ⌦ ⇢dx = 1, ⇢ 0. o . Consider the optimization problem therein: min ⇢2P(⌦) E(⇢). A typical choice of objective functional in Bayesian inference problem is the KL divergence: E(⇢) = DKL(⇢ke f ) = Z ⇢ log ⇢ e f dx. 9
  7. Information Metrics Typical examples include Fisher or Wasserstein information metrics.

    Information geometry I Fish information statistics (Amari, Ay, Zhang et.al.); I AI Inference problems: f–divergence, Amari ↵–divergence etc; I AI optimization methods: Natural gradient (Amari); Stochastic relaxation (Pistone, Malago), et.al. Transport geometry I Gradient descent: (Jordan, Kinderlehrer, Otto, Villani, Liu’s group, Slepcev, Carillo, Cances, Wang, Craig et.al.); I Applied math: (Osher, Rongjie, Preye, Solomon, Oberman, Benamou, Enquist, Yin, Stuart, et.al.); I Engineering: (Chen, Pavon, Georgiou, Tannenbaum, Guo et.al.); I AI Inference problems: (Cuturi et.al., Frogner et.al. 2015, Bottou et.al. 2017, Yau, Gu et.al 2019); I Optimization (Winsolo). 10
  8. Gradient descent The Gradient flow of E(⇢) in a metric

    space satisfies @t⇢t = G(⇢t) 1 E(⇢t) ⇢t . where G is the metric (precondioner) operator, which is often assume to be positive definite. Example I Fisher-Rao gradient flow of KL divergence leads to @t⇢t = grad F DKL(⇢t k⇢ ⇤ ) = ⇢(log ⇢ + f Z ⇢(log ⇢ + f)dx). I Wasserstein gradient flow of KL divergence satisfies @t⇢t = grad W DKL(⇢t k⇢ ⇤ ) =r · (⇢t r(f + log ⇢t + 1)) =r · (⇢t rf) + ⇢t. 11
  9. First-order algorithms Sampling methods are first-order methods in probability space

    based on di↵erent choices of metrics. They are usually “mean field” dynamics. I Stein metric and Stein variational gradient descent1; I Wasserstein-Kalman metric and Ensemble Kalman sampling2; I Wasserstein Newton’s flow. 1Liu and Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. NIPS, 2016. 2Garbuno-Inigo, Ho↵mann, Li, and Stuart. Interacting Langevin di↵usions: Gradient structure and ensemble Kalman sampler. SIAM applied dynamical system, 2019. 12
  10. Related works I KL divergence+Wasserstein metric + Gradient descent =

    Langevin dynamics (Optimal transport, Jordan, Kinderlehrer, Otto, Villani et.al.) I KL divergence+ Fisher-Rao metric+Gradient descent=Birth-death dynamics (Information geometry, Amari, et. al.) I KL divergence +Stein metric + Gradient descent = Stein variational gradient descent (Liu et.al.) I KL divergence+ Wasserstein-Kalman metric + Gradient descent = Ensemble Kalman sampling (Inigo, Ho↵man, Li, Stuart et.al.) Divergence+Metric + Accelerated gradient flows= ? 13
  11. Goal: Accelerated Langevin dynamics Can we design systematic accelerated gradient

    flows for samplings, which applies the symplectic structures in information metrics? I Extending from the fact that optimal transport gradient flow of KL divergences forms Langevin dynamics, what are accelerated gradient flows? I What is the “Mean-field” accelerated Langevin dynamics? 14
  12. Related works Finite dimensional I Euclidean space (Su. et.al); I

    Variational formulation (Wibisono, Wilson and Jordan); I Hamiltonian descent (C. Maddison, D. Paulin, Y. Teh, B. O’Donoghue, A. Doucet); Density space I Liu et al. (ICML 2019): acceleration framework of ParVI methods based on manifold optimization; I Taghvaei and Mehta (ICML 2019): the accelerated flow from an optimal control perspective; I Carrilo et.al. (CMP 2019) study the compressible Euler equation; I Cheng et al. and Ma et al.: underdamped Langevin dynamics. 15
  13. Gradient flows I The methods for optimization problems in the

    Euclidean space have been well-studied. min x2Rn f(x). I The classical gradient descent method follows xk+1 = xk srf(xk). It can be formulated into a continuous time gradient flow, ˙ x = rf(x). 16
  14. Accelerate Gradient flows I The Nesterov’s accelerated method is widely

    applied in accelerating the vanilla gradient descent. It corresponds to the so-called accelerated gradient flow ¨ x + ↵t ˙ x + rf(x) = 0. I This is equivalent to a damped Hamiltonian system  ˙ x ˙ p +  0 ↵tp  0 I I 0  rxHE (x, p) rpHE (x, p) = 0, H E (x, p) = 1 2 kpk2 2+f(x). with initial values x(0) = x0 and p(0) = 0. 1Su et. al. A di↵erential equation for modeling Nesterov accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 2016. 17
  15. A quick review on Information Metrics Definition (Metric in the

    probability space) A metric tensor G(⇢) : T⇢ P(⌦) ! T ⇤ ⇢ P(⌦) is an invertible mapping from the tangent space at ⇢ to the cotangent space at ⇢. This metric tensor defines the metric (inner product) on the tangent space T⇢ P(⌦). Namely, for 1, 2 2 T⇢ P(⌦), we define g⇢( 1, 2) = Z 1G(⇢) 2dx = Z 1G(⇢) 1 2dx, where i is the solution to i = G(⇢) 1 i , i = 1, 2. 18
  16. Example: Fisher-Rao metric I The Fisher-Rao metric comes from Information

    Geometry. I The inverse of the Fisher-Rao metric tensor is defined by G F (⇢) 1 = ⇢ ✓ Z ⇢dx ◆ , 2 T ⇤ ⇢ P(⌦). I For 1, 2 2 T⇢ P(⌦), the Fisher-Rao metric on the tangent space follows g F ⇢ ( 1, 2) = Z 1 2⇢dx ✓Z 1⇢dx ◆ ✓Z 2⇢dx ◆ , where i is the solution to i = ⇢ i R i⇢dx , i = 1, 2. 19
  17. Example: Wasserstein metric I The Wasserstein metric comes from Optimal

    Transport. I The inverse of the Wasserstein metric tensor is defined by G W (⇢) 1 = r · (⇢r ), 2 T ⇤ ⇢ P(⌦). I The Wasserstein metric on the tangent space is given by g W ⇢ ( 1, 2) = Z ⇢ hr 1, r 2 i dx, 1, 2 2 T⇢ P(⌦), where i is the solution to i = r · (⇢r i), i = 1, 2. 20
  18. Gradient flows I The gradient flow for E(⇢) in (P(⌦),

    g⇢) takes the form @t⇢t = G(⇢t) 1 E(⇢t) ⇢t . I The Fisher-Rao gradient flow is given by @t⇢t = ⇢t ✓ E ⇢t Z E ⇢t ⇢tdx ◆ . I The Wasserstein gradient flow writes @t⇢t = r · ✓ ⇢t r E ⇢t ◆ . 21
  19. Hamiltonian flows I The Euler-Lagrange equation in the probability space

    can be formulated as a system of (⇢t, t), i.e., 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0, where t is up to a spatially-constant function shrift. I Here the above PDE system is the Hamiltonian flow @t  ⇢t t  0 1 1 0 " ⇢t H(⇢t, t) t H(⇢t, t) # = 0, with respect to the Hamiltonian H(⇢t, t) = 1 2 Z tG(⇢t) 1 tdx + E(⇢t). 22
  20. Example: Fisher-Rao Hamiltonian flow I The Fisher-Rao Hamiltonian flow follows

    8 > > < > > : @t⇢t t⇢t + Z t⇢tdx = 0, @t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. I The corresponding Hamiltonian is HF (⇢t, t) = 1 2 Z 2 t ⇢tdx ✓Z ⇢t tdx ◆2 ! + E(⇢t). 23
  21. Example: Wasserstein Hamiltonian flow I The Wasserstein Hamiltonian flow writes

    8 < : @t⇢t + r · (⇢t r t) = 0, @t t + 1 2 kr t k2 + E ⇢t = 0. I The corresponding Hamiltonian forms HW (⇢t, t) = 1 2 Z kr t k2 ⇢tdx + E(⇢t). I This is identical to the dynamics in optimal transport and Mean field games. 24
  22. Accelerate Information Gradient flow I Let ↵t 0 be a

    scalar function of t. We add a damping term ↵t t to the Hamiltonian flow. @t  ⇢t t +  0 ↵t t  0 1 1 0 " ⇢t H(⇢t, t) t H(⇢t, t) # = 0. I This renders the Accelerated Information Gradient (AIG) flow 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + ↵t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0, (AIG) with initial values ⇢0 = ⇢0 and 0 = 0. 25
  23. The choice of ↵t The choice of ↵t depends on

    the geodesic convexity of E(⇢), which has an equivalent definition as follows. Definition For a functional E(⇢) defined on the probability space, we say that E(⇢) has a -positive Hessian (in short, Hess( )) w.r.t. the metric g⇢ if there exists a constant 0 such that for any ⇢ 2 P(⌦) and any 2 T⇢ P(⌦), we have g⇢(Hess E(⇢) , ) g⇢( , ). Here Hess is the Hessian operator w.r.t. g⇢ . I If E(⇢) is Hess( ) for > 0, then ↵t = 2 p . I If E(⇢) is Hess(0), then ↵t = 3/t. 26
  24. Fisher-Rao AIG flow I The Fisher-Rao AIG flow writes 8

    > > < > > : @t⇢t t⇢t + ✓Z t⇢tdx ◆ ⇢t = 0, @t t + ↵t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. (F-AIG) 27
  25. Wasserstein AIG flow I The Wasserstein AIG flow in the

    Eulerian formulation follows 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + ↵t t + 1 2 kr t k2 + E ⇢t = 0. (W-AIG) I The Wasserstein AIG flow in the Lagrangian formulation leads to 8 > < > : dXt = Vtdt, dVt = ↵tVtdt r ✓ E ⇢t ◆ (Xt)dt. I This is a particular type of initial-valued damped mean-field game systems, known in fluid dynamics (Carillo, 2020, Amir, et.al. 2019.). I Many other generalizations of AIG flows are provided in (Wang, Li, 2019). 28
  26. Examples in Gaussian families Theorem Suppose that ⇢0, ⇢⇤ 2

    N n 0 and their covariance matrices are ⌃0 and ⌃⇤ . E(⌃) evaluates the KL divergence from ⇢ to ⇢⇤ . Here (⌃t, St) satisfies ( ˙ ⌃t (St⌃t + ⌃tSt) = 0, ˙ St + ↵tSt + S2 t + 2r⌃t E(⌃t) = 0, (W-AIG-G) with initial values ⌃0 = ⌃0 and S0 = 0. Especially, for any t 0, ⌃t is well-defined and stays positive definite. 29
  27. Convergence rate of W-AIG flows Theorem Suppose that E(⇢) satisfies

    Hess( ) for > 0. The solution ⇢t to (W-AIG) with ↵t = 2 p satisfies E(⇢t)  C0e p t = O(e p t ). If E(⇢) only satisfies Hess(0), then the solution ⇢t to (W-AIG) with ↵t = 3/t satisfies E(⇢t)  C 0 0t 2 = O(t 2 ). Here the constants C0, C 0 0 only depend on ⇢0. 30
  28. PDE convexities Convergence rate I Hess( ) is equivalent to

    the -geodesic convexity in the probability space w.r.t. g⇢ . For the Wasserstein metric, it is also known as -displacement convexity. I If E(⇢) is the KL divergence and the target density takes the form ⇢ ⇤ / exp( f(x)). – A su cient condition for Hess( ) is that f(x) is -strongly convex. – The W-AIG flow improves the O(e 2 t) convergence rate of the gradient flow to O(e p t), especially when is close to 0. Convexities and entropy dissipation I -geodesic convexity=Wasserstein Hessian operator (Otto, Villani et.al.)=Weak form of Gamma calculus (Bakry-Emery et.al.) I Fisher-Rao + density manifold Christo↵el symbol=Weak form of Gamma calculus (Li, 2018). 31
  29. Sketch of the proof I In the proof of previous

    two propositions, we consider ut = @t(Tt) 1 Tt with properties r · (⇢t(ut r t)) = 0, @tTt = rTtut. I We show the following relationship Z hr t, @tTt + rTt r t i ⇢tdx = Z hr t ut, rTt(r t ut)i ⇢tdx 0. This is based on the Hodge decomposition. 32
  30. W-AIG flows in the particle level Proposition Suppose that Xt

    ⇠ ⇢t and Vt = r t(Xt) are the position and the velocity of a particle at time t. Then, the di↵erential equation of the particle system corresponding to (W-AIG) writes ( dXt = Vtdt, dVt = ↵tVtdt rf(Xt)dt r log ⇢t(Xt)dt. 33
  31. Discrete-time algorithm for W-AIG flow For i = 1, 2

    . . . N, the update rule follows ( V i k+1 = ↵kV i k p ⌧(rf(X i k) + ⇠k(X i k)), X i k+1 = X i k + p ⌧V i k+1. If E(⇢) is Hess( ), then ↵k = 1 p ⌧ 1+ p ⌧ ; if E(⇢) is Hess(0) or is unknown, then ↵k = k 1 k+2 . Here ⇠k(x) is an approximation of r log ⇢t(x). Di culties I The approximation of r log ⇢t(x). I The sti↵ness of AIG flows. 34
  32. Reviews on the choices of ⇠k I If Xi k

    follows a Gaussian distribution, then ⇠k(x) = ⌃ 1 k (x mk), where mk and ⌃k are the mean and the covariance of {Xi k }N i=1 . I For the non-Gaussian case, we use the kernel density estimation (KDE1). ⇠k writes ⇠k(x) = r log ˜ ⇢k(x) = PN i=1 rxK(x, Xi k) PN i=1 K(x, Xi k) . A common choice of K(x, y) is a Gaussian kernel with the bandwidth h, KG (x, y; h) = (2⇡h) n/2 exp kx yk2 /(2h) . 1Radhey S Singh. Improvement on some known nonparametric uniformly consistent estimators of derivatives of a density. The Annals of Statistics, pp. 394399, 1977. 35
  33. Selection of the bandwidth I SVGD1 uses a median (MED)

    method to choose the bandwidth, i.e., hk = 1 2 log(N + 1) median ⇣ {kX i k X j k k2}N i,j=1 ⌘ . I Liu et al.2 propose a Heat Equation (HE) method to adaptively adjust bandwidth. I Motivated by the HE method, we introduce the Brownian motion (BM) method to adaptively learn the bandwidth. 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 23782386, 2016. 2Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and Lawrence Carin. Accelerated first-order methods on the Wasserstein space for bayesian inference. arXiv preprint arXiv:1807.01750, 2018. 36
  34. The BM method I Given the bandwidth h, {Xi k

    }N i=1 and a step size s, we can compute two particle systems Y i k (h) = X i k s⇠k(x; h), Z i k = X i k + p 2sB i , i = 1, . . . N, where Bi is the standard Brownian motion. I Minimize MMD(ˆ ⇢Y , ˆ ⇢Z), the maximum mean discrepancy between the empirical distribution ˆ ⇢Y (x) = PN i=1 (x Y i k (h)) and ˆ ⇢Z(x) = PN i=1 (x Zi k). I Optimize over h, using the bandwidth hk 1 from the last iteration as the initialization. 37
  35. Adaptive restart I The second equation in (W-AIG) is the

    Hamilton-Jacobi equation, which usually has the strong sti↵ness. We propose an adaptive restart technique to deal with this problem. Consider 'k = N X i=1 ⌦ V i k+1, rf(X i k) + ⇠k(X i k) ↵ , which can be viewed as discrete-time approximation of @tE(⇢t). I If 'k < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. I This essentially keeps @tE(⇢t) negative along the trajectory. I The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 38
  36. Overall algorithm Algorithm 1 Particle implementation of Wasserstein AIG flow

    Require: initial positions {Xi 0} N i=1 , step size ⌧, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl 1 , {Xi k} N i=1 , p⌧). 4: Calculate ⇠k (Xi k ) with the bandwidth hl . 5: Set ↵k based on whether E(⇢) is Hess( ) or Hess(0). Update V i k+1 = ↵k V i k p⌧(rf(Xi k ) + ⇠k (Xi k )), Xi k+1 = Xi k + p⌧V i k+1 . 6: if RESTART then 7: Compute 'k = P N i=1 ⌦ V i k+1 , rf(Xi k ) + ⇠k (Xi k ) ↵ . 8: If 'k < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 39
  37. Toy example I The target density ⇢ ⇤ is a

    toy bimodal distribution1. I We compare two types of particle implementations of the Wasserstein gradient flow over the KL divergence: X i k+1 = X i k ⌧rf(X i k)+ p 2⌧B i k. X i k+1 = X i k ⌧rf(X i k) ⌧⇠k(X i k). Here Bi k ⇠ N(0, 1) is the standard Brownian motion and ⇠k is computed by KDE. The first method is known as the MCMC method. The second method is called the ParVI method, whose bandwidth h is selected by MED/HE/BM respectively. 1Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015. 40
  38. Gaussian examples: ODE level I The initial value is set

    to be ⌃0 = I. I The following figure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG flow and the e↵ect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1,  ⇡ 3.8 ⇥ 103. Right: = 1,  ⇡ 4.0 ⇥ 103. 42
  39. Gaussian examples: Particle level I The initial distribution of samples

    follows N(0, I) and the number of samples is N = 600. I For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ ⌃k) using the empirical covariance matrix ˆ ⌃k . Figure: The acceleration of W-AIG flow and the e↵ect of adaptive restart (particle level). The setting of the target density is identical to the previous figure. 43
  40. Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian

    logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 44
  41. Topic II: Optimal control in probability space I Computational mean-field

    information dynamics associated with reaction-di↵usion equations. I Controlling conservation laws I, II: entropy-entropy flux. I Mean field games. I Master equations for finite state mean field games with nonlinear activations. 45
  42. Reaction di↵usion Consider a scalar nonlinear reaction di↵usion equation by

    @tu(t, x) = F(u(t, x)) + R(u(t, x)), where x 2 ⌦, ⌦ ⇢ Rd is a compact convex spatial domain, u 2 M(⌦) = {u 2 C1(⌦): u 0}, and is the Euclidean Laplacian operator. 5
  43. Reaction di↵usions and generalized optimal transport In this talk, we

    first review reaction di↵usions with optimal transport type metric space. Using this connection, we further design mean field control problems for reaction-di↵usion. This also provides variational schemes for solving classical reaction-di↵usion in both time and spatial domains. 6
  44. Literature The optimal transport problem was first introduced by Monge

    in 1781 and relaxed by Kantorovich (Nobel prize) in 1940. It defines a distance in the space of probability distributions, named optimal transport, Wasserstein distance, or Earth Mover’s distance. I Mapping/Monge-Amp´ ere equation: Gangbo, Brenier, et.al; I Lyapunov methods and Gradient flows: Otto, Villani, Ambrosio, Gigli, Savare, Carillo, Mielke, et.al; I Hamiltonian flows: Compressible Euler equations, Potential mean field games, Schrodinger bridge problems, Schrodinger equations: Benamou, Brenier, Larsy, Lions, Carmona, Souganidis, Georgiou, Nelson, La↵erty, et.al. I Numerical OT and MFG: Benamou, Nurbekyan, Oberman, Osher, Achdou, Caines, Tamer, Lai, Lauriere, et.al. 8
  45. Lyapunov methods Consider a Lyapunov functional G(u) = Z G(u(x))dx,

    where G: R ! R is a given convex function with G00(u) > 0. In fact, how to choose entropy G can be given from entropy-entropy flux-metric condition. We have seen it in Prof. Osher’s talk. 10
  46. Metrics Along the solution of reaction di↵usion, one observes that

    d dtG(u(t, ·)) = Z G0(u) · @tudx = Z G0( F + R)dx = Z ⇣ rG0, rF ⌘ dx + Z G0Rdx = Z ⇣ rG0, ru ⌘ F0dx + Z G0Rdx = Z ⇣ rG0, ru ⌘ G00 F0(u) G00 dx + Z G02 R G0 dx = Z ⇣ rG0, rG0 ⌘ F0 G00 dx + Z G0(u)2 R G0 dx, where we apply integration by parts in the third equality and rG0(u) = G00(u)ru in the last equality. 11
  47. Lyapunov methods induced metrics I Denote the smooth positive probability

    space by M = n u 2 C1(⌦): u > 0 o . Given F, G: R ! R satisfying F0(u) > 0 if u > 0, and G00(u) > 0. Denote V1(u) = F0(u) G00(u) , V2(u) = R(u) G0(u) . Denote the tangent space of M at u 2 M by TuM = n 2 C1(⌦) o . 12
  48. Lyapunov methods induced metrics II Definition (Mean-field information metric) The

    inner product g(u): TuM ⇥ TuM ! R is given below. For any 1 , 2 2 TuM, define g(u)( 1, 2) = Z ⌦ (r 1, r 2)V1(u)dx + Z ⌦ 1 2V2(u)dx, where i = r · (V1(u)r i) + V2(u) i, i = 1, 2. 13
  49. Reaction-di↵usions in metric spaces Given an energy functional F :

    M ! R, the gradient flow of F in (M(⌦), g) satisfies @tu(t, x) = r · (V1(u)r uF(u))(t, x) V2(u) uF(u)(t, x). In particular, if F(u) = G(u) = R G(u)dx, then @tu =r · (V1(u)rG0(u)) V2(u)G0(u) =r · ( F0(u) G00(u) G00(u)ru) + R(u) G0(u) G0(u) =r · (F0(u)ru) + R(u) = F(u) + R(u). 14
  50. Controlling reaction di↵usions Denote a given energy functional by F

    : M(⌦) ! R. Consider a variational problem by inf v1,v2,u Z 1 0 h Z ⌦ 1 2kv1k 2V1(u) + 1 2|v2| 2V2(u)dx F(u) i dt where the infimum is taken among all density functions u: [0, 1] ⇥ ⌦ ! R, vector fields v1 : [0, 1] ⇥ ⌦ ! Rd, and reaction rate functions v2 : [0, 1] ⇥ ⌦ ! R, such that @tu(t, x) + r · (V1(u(t, x))v1(t, x)) = v2(t, x)V2(u(t, x)), with fixed initial and terminal density functions u0 , u1 2 M(⌦). 16
  51. Mean field information dynamics The critical point of variational problem

    satisfies v1(t, x) = r (t, x), v2(t, x) = (t, x), with 8 > > > > < > > > > : @tu(t, x) + r · (V1(u(t, x))r (t, x)) = V2(u(t, x)) (t, x) @t (t, x) + 1 2kr (t, x)k 2V 0 1 (u(t, x)) + 1 2| (t, x)| 2V 0 2 (u(t, x)) + uF(u)(t, x) = 0. 17
  52. Mean field information Hamiltonian There exists a Hamiltonian formulations for

    the control problem. 8 > < > : @tu = H(u, ), @t = uH(u, ), where H is the Hamiltonian functional defined by H(u, ) = 1 2 Z ⌦ kr k 2V1(u)dx + 1 2 Z ⌦ | | 2V2(u)dx + F(u). 18
  53. Example I Example (Wasserstein metric, heat flow and density Hamilton

    flows) Let G(u) = u log u u, F(u) = u, R(u) = 0, thus V1(u) = F0(u) G00(u) = u, V2(u) = R(u) G0(u) = 0. Then the mean-field information metric forms g(u)( 1, 2) = Z ⌦ (r 1(x), r 2(x))u(x)dx. Here the Hamiltonian flow in (M(⌦), g) satisfies 8 < : @tu + r · (ur ) = 0 @t + 1 2kr k 2 + uF(u) = 0. 19
  54. Example II Example (Fisher-Rao metric, Birth-death equation and density Hamiltonian

    flows) Consider V1(u) = F0(u) G00(u) = 0, V2(u) = R(u) G0(u) = u. Then the mean-field information metric satisfies g(u)( 1, 2) = Z ⌦ 1(x) 2(x)u(x)dx. Here the Hamiltonian flow in (M(⌦), g) satisfies 8 < : @tu u = 0 @t + 1 2| | 2 + uF(u) = 0. 20
  55. Example III Example (Fisher-KPP metric, Fisher-KPP equation and density Hamiltonian

    flows) Consider F(u) = u, G(u) = u log u u, R(u) = u(1 u). Thus V1(u) = F0(u) G00(u) = u, V2(u) = R(u) G0(u) = u(u 1) log u . Here the Hamiltonian flow in (M(⌦), g) satisfies 8 > > < > > : @tu + r · (ur ) u(u 1) log u = 0 @t + 1 2kr k 2 + | | 2 (2u 1) log u + 1 u (log u)2 + uF(u) = 0. 21
  56. Primal-dual hybrid gradient algorithms Consider the following convex optimization problem.

    min z f(Az) + g(z), where z is a variable to be minimized, f and g are convex functions and A is a continuous linear operator. The algorithm solves the problem by iterating z (k+1) = argmin z L(z, ¯ p) + 1 2⌧ kz z (k)k2 L 2 p (k+1) = argmax p L(z (k+1) , p) 1 2 kp p (k)k2 L 2 ¯ p = 2p (k+1) p (k) , (1) where p is a dual variable, L is a Lagrangian functional L(z, p) = g(z) + hAz, pi f ⇤(p) and f ⇤(p) = sup z hz, pi f(z) is the Legendre transform of f. 22
  57. Primal-dual hybrid gradient methods The scheme converges if the step

    sizes ⌧ and satisfy ⌧ kAT Ak L 2 < 1, where k · k L 2 is an operator norm in L 2 . These step sizes are dependent on the operator A. In this talk, we use G-prox PDHG, a variation of the primal-dual algorithm. G-Prox PDHG provides an appropriate choice of norms for the algorithm that allows the algorithm to converge faster than the vanilla PDHG algorithm. 23
  58. G-prox primal-dual hybrid gradient methods G-prox PDHG finds the saddle

    point by iterating z (k+1) = argmin z L(z, ¯ p) + 1 2⌧ kz z (k)k2 L 2 p (k+1) = argmax p L(z (k+1) , p) 1 2 kp p (k)k2 H ¯ p = 2p (k+1) p (k) , where the norm k · k H is defined as kzk2 H = kAT zk2 L 2 . By choosing this proper norm, the step size only needs to satisfy ⌧ < 1, which is independent of an operator A. 24
  59. Implementation of the algorithm We set the linear operator A

    as A(m1, m2, u) = @tu(t, x) + r · m1(t, x) m2(t, x). We set convex functionals f and g below g(m1, m2, u) = Z 1 0 Z ⌦ km1(t, x)k2 2V1(u(t, x)) + |m2(t, x)|2 2V2(u(t, x))dx F(u) dt f (A(m1, m2, u)) = ( 0 if A(m1, m2, u) = 0 1 otherwise f ⇤( ) = 0. Thus, we have the Lagrangian functional L(m1, m2, u, ) = Z 1 0 Z ⌦ n km1(t, x)k2 2V1(u(t, x)) + |m2(t, x)|2 2V2(u(t, x)) + (t, x) ⇣ @tu(t, x) + r · m1(t, x) m2(t, x) ⌘o dx F(u) dt. 25
  60. Implementation of the algorithm We find the saddle point (m

    ⇤ 1, m ⇤ 2, u ⇤ , ⇤) by iterating the following schemes: u (k+1) = argmin u L(m (k) 1 , m (k) 2 , u, ¯) + 1 2⌧ ku u (k)k2 L 2 m (k+1) 1 = argmin m L(m, m (k) 2 , u (k) , ¯) + 1 2⌧ km m (k) 1 k2 L 2 m (k+1) 2 = argmin m L(m (k) 1 , m, u (k) , ¯) + 1 2⌧ km m (k) 2 k2 L 2 (k+1) = argmax L(m (k+1) 1 , m (k+1) 2 , u (k+1) , ) 1 2 k (k)k2 H ¯ = 2 (k+1) (k) . Here, L 2 and H norms are defined as kuk2 L 2 = (u, u) L 2 = Z 1 0 Z ⌦ kuk2 dx dt, kuk2 H = kAT uk2 L 2 for any u : [0, 1] ⇥ ⌦ ! Rn. 26
  61. Implementation of the algorithm We can find the explicit formula

    of u (k+1) , m (k+1) 1 , m (k+1) 2 , (k+1) using the first-order optimality conditions. The optimality conditions of m (k+1) 1 , m (k+1) 2 , (k+1) result in first-order equations which can be solved easily. Thus, we have m (k+1) 1 = V1(u (k)) ⌧ + V1(u(k)) ⇣ m (k) 1 + ⌧r (k) ⌘ m (k+1) 2 = V2(u (k)) ⌧ + V2(u(k)) ⇣ m (k) 2 + ⌧ (k) ⌘ (k+ 1 2 ) = (k) + (AAT ) 1 ⇣ @tu (k+1) + r · m (k+1) 1 m (k+1) 2 ⌘ 27
  62. Implementation of the algorithm The optimality condition of u (k+1)

    involves the nonlinear equation, which we compute using Newton’s method. For j = 1, 2, · · · a = km1k2 2V 2 1 (u j ) V 0 1 (uj) |m2|2 2V 2 2 (u j ) V 0 2 (uj) @ t + c log uj + 1 ⌧ (uj u(k)) b = km1k2 V 3 1 (u j ) (V 0 1 (uj))2 km1k2 2V 2 1 (u j ) V 00 1 (uj) + |m2|2 V 3 2 (u j ) (V 0 2 (uj))2 |m2|2 2V 2 2 (u j ) V 00 2 (uj) + c u j + 1 ⌧ uj+1 = max(0, uj a b ) 28
  63. Algorithm Algorithm: G-prox PDHG for mean-field information variational problem Input:

    Initial density u0 and terminal density u1. Output: u, , m2 : [0, T] ⇥ ⌦ ! R, m1 : [0, T] ⇥ ⌦ ! Rd. For k = 1, 2, · · · u(k+1) Compute u(k+1) using Newton’s method. m(k+1) 1 = V1(u (k)) ⌧+V1(u (k)) ⇣ m(k) 1 + ⌧r (k) ⌘ m(k+1) 2 = V2(u (k)) ⌧+V2(u (k)) ⇣ m(k) 2 + ⌧ (k) ⌘ (k+ 1 2 ) = (k) + (AAT ) 1 ⇣ @ t u(k+1) + r · m(k+1) 1 m(k+1) 2 ⌘ (k+1) = 2 (k+ 1 2 ) (k) 29
  64. Example We present a numerical experiment using the Algorithm with

    di↵erent V1 , V2 functions. We wrote C++ codes to run the numerical experiments. For all the experiments, we used the discretization sizes Nx1 = Nx2 = 128, Nt = 30. V1(u) = u, V2(u) = u↵, ↵ > 0. We show how varying the exponent ↵ a↵ects the evolution of the densities. Below are the initial and terminal densities used in the simulations. u0(0, x) = 10 exp 60 (x 0.3)2 + (y 0.7)2 + 1, u1(0, x) = 20 exp 60 (x 0.7)2 + (y 0.3)2 + 1. 30
  65. Figure: Example: Snapshots of the densities at di↵erent times t.

    The first, the second, the third and the fourth rows represent V2(u) = u, V2(u) = u 2 , V2(u) = u 3 , V2(u) = u 4 , respectively. 31
  66. Figure: Example: Cross sections along a line x + y

    = 1 for di↵erent time t. Each color represents V2(u) = u, V2(u) = u 2 , V2(u) = u 4 . 32
  67. Topic II: Optimal control in probability space I Computational mean-field

    information dynamics associated with reaction-di↵usion equations. I Controlling conservation laws I, II: entropy-entropy flux. I Mean field games. I Master equations for finite state mean field games with nonlinear activations. 45
  68. Regularized scalar conservation laws Consider @ t u + r

    · f(u) = r · (A(u)ru), where the known variable is u = u(t, x), (t, x) 2 R + ⇥ Rd, f : Rd ! Rd is a flux function, A(u) 2 Rd⇥d is a positive semi-definite matrix function, and > 0 is a viscosity constant. The PDE is known as a regularized scalar conservation law. 4
  69. Conservation laws and generalized optimal transport In literature, there are

    actively joint studies to work on entropy, Fisher information, and transportation. Nowadays, these connections have applications in mean field games, information theory, AI, quantum computing, computational graphics, computational physics, and Bayesian inverse problems. In this talk, we connect Lax’s entropy-entropy flux in conservation laws with optimal transport type metric spaces. Following this connection, we further design variational discretizations for conservation laws and mean field control of conservation laws. This includes a new and e cient method for developing unconditionally stable in time implicit approximations to regularized conservation laws and many other initial value problems. 5
  70. Heat equation and Entropy dissipation Consider the heat equation in

    Rd by @u(t, x) @t = r · (ru(t, x)) = u(t, x). Consider the negative Boltzmann-Shannon entropy given by H(u) = Z Rd u(x)(log u(x) 1)dx. Along the time evolution of heat equation, the following dissipation relation holds: d dtH(u(t, ·)) = Z Rd krx log u(t, x)k 2u(t, x)dx = I(u(t, ·)), where I(u) is the Fisher information functional. There is a formulation behind this relation, namely I The heat flow is a gradient descent flow of entropy in optimal transport metric. 7
  71. Optimal transport What is the optimal way to move or

    transport the mountain with shape X, density u0(x) to another shape Y with density u1(y)? 8
  72. Literature The optimal transport problem was first introduced by Monge

    in 1781 and relaxed by Kantorovich (Nobel prize) in 1940. It defines a distance in the space of probability distributions, named optimal transport, Wasserstein distance, or Earth Mover’s distance. I Mapping/Monge-Amp´ ere equation: Gangbo, Brenier, et.al; I Gradient flows: Otto, Villani, Ambrosio, Gigli, Savare, Carillo, Mielke, et.al; I Hamiltonian flows: Compressible Euler equations, Potential mean field games, Schrodinger bridge problems, Schrodinger equations: Benamou, Brenier, Larsy, Lions, Georgiou, Tallenbum, Nelson, La↵erty, Liu, Gao, et.al. I Numerical OT and Mean field control/MFG: Benamou, Oberman, Osher, Achdou, Yang, Justin, Fu, et.al. I Natural gradient, Mean field control/MFG and Machine learning: Amari, Li, Chen, Montufar, Yang, Liu, Nurbekyan, Chen, Osher, Lars, Zhang, Katsoulakis, Zhou, et.al. 9
  73. Entropy dissipation = Lyapunov methods The gradient flow of the

    negative entropy H(u) = Z Rd u(x)log u(x)dx, w.r.t. optimal transport metric distance satisfies @u @t = r · (urlog u) = u. Here the major trick is that ur log u = ru. In this way, one can study the entropy dissipation by d dtH(u) = Z Rd log ur · (urlog u)dx = Z Rd kr log uk 2udx. 10
  74. Lyapunov method induced calculus Informally speaking, Wasserstein-Otto metric refers to

    the following bilinear form: h ˙ u1, G(u) ˙ u2i = Z ( ˙ u1, ( u ) 1 ˙ u2)dx. In other words, denote u = r · (ur), and ˙ u i = u i = r · (ur i ), i = 1, 2, then h 1, G(u) 1 2i = h 1, r · (ur) 2i = Z (r 1, r 2)udx, where u 2 P(⌦), ˙ u i is the tangent vector in P(⌦) with Z ˙ u i dx = 0, and i 2 C1(⌦) are cotangent vectors in P(⌦) at the point u. Here r·, r are standard divergence and gradient operators in ⌦. 11
  75. Gradient flows The Wasserstein gradient flow of an energy functional

    F(u) leads to @ t u = G(u) 1 uF(u) =r · (ur uF(u)). I If F(u) = R F(x)u(x)dx, then @ t u = r · (urF(x)). I If F(u) = R u(x) log u(x)dx, then @ t u = r · (ur log u) = u. 12
  76. Hamiltonian flows Consider the Lagrangian functional L(u, @ t u)

    = 1 2 Z ⇣ @ t u, ( r · (ur)) 1@ t u ⌘ dx F(u). By the Legendre transform, H(u, ) = sup @tu Z @ t u dx L(u, @ t u). And the Hamiltonian system follows @ t u = H(u, ), @ t = uH(u, ), where u , are L2 first variation operators w.r.t. u, , respectively and the density Hamiltonian forms H(u, ) = 1 2 Z kr k 2udx + F(u). Here u is the “density” state variable and is the “density” moment variable. 13
  77. Examples: Optimal transport and mean field games Thus, the Hamiltonian

    flow satisfies 8 < : @ t u + r · (ur ) = 0 @ t + 1 2kr k 2 = uF(u). This is a well known dynamic, which is studied in potential mean field games, optimal transport and PDE. 14
  78. Optimal transport type formulaisms We are currently working on the

    “universe” of generalized optimal transport and mean field games. I AI theory, inference, and computations; I Quantum computing; I Algorithms of mean-field MCMC algorithms, and numerical PDEs. In this lecture, we point out that there are optimal transport type formalisms to study the control of conservation laws and to design variational numerical schemes. Related works I Godunov (1970); I Lax–Friedrichs (1971); I Brenier (2018); I Osher, Shu, Harten, van Leer: Monotone schemes, TVD schemes, ENO, WENO, et.al. 15
  79. Entropy-Entropy flux pairs Consider @ t u(t, x) + rx

    · f(u) = 0. where u: R + ⇥ Rn ! Rd, and f : Rd ! Rd. Definition (Entropy-entropy flux pair for systems (Lax)) We call (G, ) an entropy-entropy flux pair for the above conservation law system if there exists a convex function G: Rd ! R, and : Rd ! R, such that 0(u) = G0(u)f0(u). And an entropy solution satisfies @ t G(u) + rx · (u)  0. This is trivial for scalar conservation laws, where d = 1. 17
  80. Lyapunov methods: Entropy-Entropy flux Denote G(u) = Z ⌦ G(u)dx.

    In this case, Z ⌦ G0(u)r · f(u)dx = Z ⌦ G0(u)(f0(u), ru)dx = Z ⌦ ( 0(u), ru)dx = Z ⌦ r · (u)dx = 0, where we apply the fact that f0(u)G0(u) = 0(u) and R ⌦ r · (u)dx = 0. 18
  81. Lyapunov methods: Viscosity In addition, Z ⌦ G0(u)r · (A(u)ru)dx

    = Z ⌦ rG0(u), A(u)ru dx = Z ⌦ rG0(u), A(u)G00(u) 1 rG0(u) dx, where we apply rG0(u) = G00(u)ru, in the last equality. We require that A(u)G00(u) 1 ⌫ 0, and A is nonnegative definite, G00(u) > 0. 19
  82. Entropy-Entropy flux-Fisher information dissipation Hence we know that @ tG(u)

    = Z ⌦ G0(u) · @ t udx = Z ⌦ rG0(u), A(u)G00(u) 1 rG0(u) dx  0. This implies that G(u) is a Lyapunov functional for regularized conservation laws. 20
  83. Entropy-Entropy flux-transport metrics Definition (Entropy-entropy flux-metric condition) We call (G,

    ) an entropy-entropy flux pair-metric for conservation laws if there exists a convex function G: Rd ! R, and : Rd ! R, such that 0(u) = G0(u)f0(u), and A(u)G00(u) 1 is symmetric positive semi-definite. 21
  84. Lyapunov methods induced transport metrics Under the entropy-entropy flux-metric condition,

    there is a metric operator for regularized conservation laws. Define the space of function u by M = n u 2 C1(⌦): Z ⌦ u(x)dx = constant o . The tangent space of M(u) at point u satisfies T uM = n 2 C1(⌦): Z ⌦ (x)dx = 0 o . Denote an elliptic operator LC : C1(⌦) ! C1(R) by LC(u) = r · (A(u)G00(u) 1 r). 22
  85. Lyapunov methods induced transport metrics Definition (Metric) The inner product

    g(u): T uM ⇥ T uM ! R is given below. g(u)( 1, 2) = Z ⌦ ( 1, LC(u) 2)dx = Z ⌦ 1r · (A(u)G00(u) 1 r 2)dx = Z ⌦ (r 1, A(u)G00(u) 1 r 2)dx = Z ⌦ 1 2dx = Z ⌦ 2 1dx, where i 2 C1(⌦) satisfies i = r · (A(u)G00(u) 1 r i ), i = 1, 2. 23
  86. Entropy induced Gradient flows Proposition (Gradient flow) Given an energy

    functional E : M ! R, the gradient flow of E in (M, g) satisfies @ t u = r · (A(u)G00(u) 1 r uE(u)). If E(u) = G(u) = R ⌦ G(u)dx, then the above gradient flow satisfies @ t u = r · (A(u)ru). 24
  87. Flux-gradient flows Definition (Flux–gradient flow) Given an energy functional E

    : M ! R, consider a class of PDE @ t u + rx · f1(x, u) = rx · (A(u)G00(u) 1 rx uE(u)), (1) where f1 : ⌦ ⇥ R1 ! Rn is a flux function satisfying Z ⌦ f1(x, u) · rx u(x)E(u)dx = 0. Remark This is the generalized optimal transport for GENERIC (nonequilibrium reversible-irreversible coupling), Grmela and Ottinger, 1997. 25
  88. Example If E(u) = G(u) = R ⌦ G(u)dx and

    f1(x, u) = f(u), then equation (1) forms regularized conservation laws. @ t u + rx · f(u) = rx · (A(u)G00(u) 1 rx uG(u)) = rx · (A(u)G00(u) 1 rx G0(u)) = rx · (A(u)G00(u) 1G00(u)rx u) = rx · (A(u)ru), and Z ⌦ rx uG(u) · f(u)dx = Z ⌦ rx G0(u) · f(u)dx = Z ⌦ G0(u) · ⇣ rx · f(u) ⌘ dx = Z ⌦ r · (u)dx =0. 26
  89. Controlling flux-gradient flows Given smooth functionals F, H: M !

    R, consider a variational problem inf u,v,u1 Z 1 0 h Z ⌦ 1 2 v, A(u)G00(u) 1v dx F(u) i dt + H(u1), where the infimum is taken among variables v: [0, 1] ⇥ ⌦ ! Rn, u: [0, 1] ⇥ ⌦ ! R, and u1 : ⌦ ! R satisfying @ t u + r · f(u) + r · (A(u)G00(u) 1v) = r · (A(u)ru), and u(0, x) = u0(x). 27
  90. Primal-dual conservation laws Proposition (Hamiltonian flows of conservation laws) The

    critical point system of the variational problem is given below. There exists a function : [0, 1] ⇥ ⌦ ! R, such that v(t, x) = r (t, x), and 8 > > < > > : @ t u + r · f(u) + r · (A(u)G00(u) 1 r ) = r · (A(u)ru), @ t + (r , f0(u)) + 1 2 (r , (A(u)G00(u) 1)0 r ) + uF(u) = r · (A(u)r ) + (r , A0(u)ru). (2) Here 0 represents the derivative w.r.t. variable u. The initial and terminal time conditions satisfy u(0, x) = u0(x), u1 H(u1) + (1, x) = 0. 28
  91. Hamiltonian flows of conservation laws Proposition The primal-dual conservation law

    system (2) has the following Hamiltonian flow formulation. @ t u = HG(u, ), @ t = uHG(u, ), where we define the Hamiltonian functional HG : M ⇥ C1(⌦) ! R by HG(u, ) = Z ⌦ h1 2 (r , A(u)G00(u) 1 r ) + (r , f(u)) (r , A(u)ru) i dx + F(u). In other words, the Hamiltonian functional HG(u, ) is conserved along Hamiltonian dynamics: d dtHG(u, ) = 0. 29
  92. Modeling: Controlling tra c flows Position control. The unknown variable

    u in tra c flows represents the density function of cars (particles) in a given spatial domain. Here, the background dynamics of u is the classical tra c flow. The control variable is the velocity for enforcing each car’s velocity in addition to its background tra c flow dynamics. The goal is to control the “total enforced kinetic energy” of all cars, in which individual cars can determine their velocities through both noise and tra c flow interactions. 30
  93. Controlling Tra c flow problems Consider inf u,v,u1 Z 1

    0 h Z ⌦ 1 2kv(t, x)k 2u(t, x)dx F(u) i dt + H(u1), (3a) such that 0  u(t, x)  1, for all t 2 [0, 1], and @ t u(t, x)+r· u(t, x)(1 u(t, x)) +r· u(t, x)v(t, x) = u(t, x), (3b) with u(0, x) = u0(x). 31
  94. Controlling tra c flow dynamics There exists a scalar function

    , such that v(t, x) = r (t, x), and 8 < : @ t u + r · u(1 u) + r · ur = u, @ t + 1 2u, r + 1 2kr k 2 + uF(u) = . 32
  95. The Primal-dual Algorithms The classical primal-dual hybrid gradient algorithms (PDHG)

    solves the following saddle point problem min z max p hKz, piL2 + g(z) h ⇤(p). When the operator K is nonlinear, we apply its extension with K(z) ⇡ K(¯ z) + rK(¯ z)(z ¯ z). Here, the extension of the PDHG scheme is as follows z n+1 = argmin z hz, [rK(z n)]T ¯ p n iL2 + g(z) + 1 2⌧ kz z n k 2 L2 , p n+1 = argmax p hK(z n+1), piL2 h ⇤(p) 1 2 kp p n k 2 L2 , ¯ p n+1 = 2p n+1 p n . Set z = (u, m), p = , K ((u, m)) = @tu + r · f(u) + r · m u, g ((u, m)) = Z 1 0 ✓Z ⌦ kmk 2 2u + 1[0,1] (u)dx F(u) ◆ dt, h(Kz) = ( 0 if Kz = 0 +1 else . 33
  96. Solve the Control Problem with Primal-dual Algorithms We rewrite the

    varational problem as follows: inf u,m sup L(u, m, ), with u(0, x) = u0(x), (1, x) = u(1, x)H(u), L(u, m, ) = Z 1 0 ✓Z ⌦ kmk2 2u + 1[0,1] (u)dx F(u) ◆ dt + Z 1 0 Z ⌦ (@ t u + r · f(u) + r · m u) dxdt, and f(u) = u(1 u) is the tra c flux function. We denote the indicator function by 1A(x) = ( +1 x / 2 A 0 x 2 A . 34
  97. Algorithm Algorithm :PDHG for the conservation law control system While

    k < Maximal number of iteration u(k+1), m(k+1) = argmin u L(u, m, ¯(k)) + 1 2⌧ k(u, m) (u(k), m(k))k2 L2 ; (k+1) = argmax L(u(k+1), m(k+1), ) 1 2 k (k) k2 H2 1 ; ¯(k+1) = 2 (k+1) (k); 35
  98. Monotone schemes Definition For p, q 2 N, a scheme

    u k+1 j = G(u k j p 1, ..., u k j+q ) is called a monotone scheme if G is a monotonically nondecreasing function of each argument. Lax–Friedrichs Scheme. For discretization t, x in time and space, denote u k j = u(k t, j x), then the Lax–Friedrichs scheme is as follows u k+1 j = u k j t 2 x ⇣ f(u k j+1 ) f(u k j 1 ) ⌘ +( +c x) t ( x)2 ⇣ u k j+1 2u k j + u k j 1 ⌘ . To guarantee that the above scheme is monotone, we need: 1 2( + c x) t ( x)2 0, t 2 x |f 0(u)| + ( + c x) t ( x)2 0. As we want the scheme works when ! 0, the restriction on c and space–time stepsizes can be simplified as follows: c 1 2 |f 0(u)|, ( x)2 2( + c x) t. The first inequality suggests the artificial viscosity we need to add. The second one impose a strong restriction on the stepsize in time when > 0. 36
  99. Discretization of the control problem We consider the control problem

    of scalar conservation law defined in [0, b] ⇥ [0, 1] with periodic boundary condition on the spatial domain. Given N x , N t > 0, we have x = b Nx , t = 1 Nt . For x i = i x, t l = l t, define u l i = u(tl, xi) 1  i  Nx, 0  l  Nt, m l 1,i = (mx1 (tl, xi))+ 1  i  Nx, 0  l  Nt 1, m l 2,i = (mx1 (tl, xi)) 1  i  Nx, 0  l  Nt 1, l i = (tl, xi) 1  i  Nx, 0  l  Nt, Nt i = u(1, xi) H(u Nt i ) 1  i  Nx, where u + := max(u, 0) and u = u + u. Note here m l 1,i 2 R+, m l 2,i 2 R . Denote (Du)i := ui+1 ui x [Du]i := ⇣ (Du)i , (Du)i 1 ⌘ d [Du]i = ⇣ (Du)+ i , (Du)i 1 ⌘ Lap(u)i = ui+1 2ui + ui 1 ( x)2 37
  100. Discretization of the control problem The first conservation law equation

    adapted from the Lax–Friedrichs scheme satisfies ⇣ u l+1 i u l i ⌘ t + ⇣ f(u l+1 i+1 ) f(u l+1 i 1 ) ⌘ 2 x +(Dm)l 1,i 1 +(Dm)l 2,i = ( +c x)Lap(u)l+1 i . Following the discretization of the conservation law, the discrete saddle point problem has the following form: min u,m max L(u, m, ), where L(u, m, ) = x t X i,l (m l 1 1,i )2 + (m l 1 2,i )2 2u l i + 1[0,1] (u l i ) t X l F(u l) + x X i H(u Nt i ) + x t X i,l l i ⇣ u l+1 i u l i t + f(u l+1 i+1 ) f(u l+1 i 1 ) 2 x + (Dm)l 1,i 1 + (Dm)l 2,i ( + c x)Lap(u)l+1 i ⌘ . 38
  101. Discretization of the control problem By taking the first order

    derivative of ul i , we automatically get the implicit finite di↵erence scheme for the dual equation of that is backward in time: 1 t ⇣ l+1 i l i ⌘ + ( l i+1 l i 1 ) 2 x f 0(u l i )+ 1 2 k[d D ]l ik 2+ F(u l i ) u = ( +c x)Lap( )l i. The discrete form of the Hamiltonian functional at t = t l takes the form HG(u, ) = X i 1 2 k[d D ]l ik 2 u l i + ( l i+1 l i 1 ) 2 x f(u l i ) + ( + c x)u l iLap( )l i ! F(u l). I When f(u) = 0, the above discretization reduces to the finite di↵erence scheme for the mean-field game system. I When F = 0, H = c for some constant c, the variational problem becomes classical conservation laws with initial data. No control will be enforced on the density function u. I Our algorithm provides an alternative way to solve the nonlinear conservation law with implicit discretization in time. 39
  102. Example 1: Tra c flows We consider the tra c

    flow equation @tu + @xf(u) = 0, u(0, x) = ( 0.8, 1  x  2 0 else , where f(u) = 1 2 u(1 u), F = 0, H = 0, = 0, k = 0.5. Figure: Left: a comparison with the exact entropy solution at t = 1; middle: the numerical solution via solving the control problem; right: the numerical solution to the conservation law using Lax–Friedrichs scheme. 40
  103. Example 2: Tra c flows transport problems We consider the

    tra c flow equation with f(u) = u(1 u), = 0.1, F = 0, c = 0.5. The final cost functional H(u(1, ·)) = µ Z ⌦ u(1, x) log( u(1, x) u1 )dx, µ = 1. We set u0 = 0.001 + 0.9e 10(x 2)2 , u1 = 0.001 + 0.45e 10(x 1)2 + 0.45e 10(x 3)2 . We also compare the result from the control of conservation law with a mean-field game problem, i.e., f = 0. 41
  104. Example 2: Tra c flows transport problems 0 2 4

    x 0 0.5 1 0 2 4 x 0 0.2 0.4 0.6 0 2 4 x 0 0.2 0.4 0.6 0 2 4 x 0 0.2 0.4 0.6 Figure: From left to right: initial configurations of u0 , u1 , solution u(1, x) for the control of conservation law, solution u(1, x) for the mean-field game problem. Figure: Left: solution u(t, x) for the problem of controlling the conservation law; right: solution u(t, x) for the mean-field game problem. 42
  105. Example 3: Tra c control Consider the tra c flow

    equation with f(u) = u(1 u), = 10 3,F = ↵ R ⌦ u log(u)dx, ↵ 0. The final cost functional H(u(1, ·)) = R ⌦ u(1, x)g(x)dx. The initial density and final cost function are as follows u0(x) = ( 0.4 0.5  x  1.5 10 3 else , g(x) = 0.1 sin(2⇡x). Figure: Solution u(t, x) for the problem of controlling the conservation law. From left to right: ↵ = 0, 0.5, 1. 43
  106. Example 3: Tra c control 0 2 4 x -0.2

    0 0.2 0.4 u 0 g 0 2 4 x 0 0.5 1 =0 =0.5 =1 0 0.5 1 t 0 0.5 1 =0 =0.5 =1 Figure: Left: boundary conditions for the control problems u0 . Middle: solution u(1, x) for the problem of controlling the conservation law. Right: the numerical Hamiltonian HG(u, ). Figure: Solution u(t, x) for the problem of controlling the conservation law. From left to right: ↵ = 0, 0.5, 1. 44
  107. Extending to systems of conservation law Figure: Control 1D compressible

    Navier-Stokes equations. Top: no control. Bottom: with control. 45