Guillaume Carlier (Université Paris Dauphine - PSL, France) Displacement Smoothness of Entropic Optimal Transport and Applications to some Evolution Equations and Systems
WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24
some evolution equations Guillaume Carlier a Based on joint works with Lénaïc Chizat and Maxime Laborde (2022) and Hugo Malamut (in progress). Workshop on Optimal Transport from Theory to Applications, Interfacing Dynamical Systems, Optimization, and Machine Learning, Berlin, march 2024. aCEREMADE, Université Paris Dauphine and MOKAPLAN (Inria- Dauphine). /1
and X2 convex compact subsets of Rd and µ1 , µ2 compactly supported probability measures on X1 and X2 respectively, the optimal transport problem of Monge and Kantorovich reads inf γ∈Π(µ1,µ2 ) X1×X2 c(x1 , x2 ) γ(dx1 , dx2 ) (1) where Π(µ1 , µ2 ) is the set of probability measures on X := X1 × X2 having µ1 and µ2 as marginals. /2
x2 ) γ(dx1 , dx2 ) + εH(γ|µ1 ⊗ µ2 ) where H stands for relative entropy, is known to be much more tractable (uniqueness, regularity, efficient computation by Sinkhorn algorithm...) and precise convergence results to the initial OT problem are by now quite well understood. Success of Sinkhorn’s algorithm (Cuturi, Peyré), connection with large deviations, stochastic control, Schrödinger bridges (Léonard, Mikami, Dawson and Gartner...). /3
is of the form γε (dx1 , dx2 ) = eφ1(x1)+φ2(x2)−c(x1,x2) ε µ1 (dx1 ) ⊗ µ2 (dx2 ) where the (so-called Schrödinger) potentials φ1 and φ2 are implicitly determined by the marginal constraints γε ∈ Π(µ1 , µ2 ). /4
c has a Ck extension over Rd × Rd), quotient Ck(X1 ) × Ck(X2 ) by the equivalence relation φ ∼ ψ if there is a constant λ such that φ1 = ψ1 + λ and φ2 = ψ2 − λ and then view T(., µ) as a self map of (the Banach space) Ck = Ck(X1 ) × Ck(X2 )/ ∼. /8
∈ Ck fitting the marginal constraints i.e. such that T(φ, µ) = 0 was established (by variational or fixed point arguments, using the Hilbert metric) by Borwein, Lewis and Nussbaum (1994). More elementary proof and extension to the multi-marginal case by using Sinkhorn algorithm by Gerolin and Di Marino (2019). Local/global inversion arguments by C. and Laborde (2020), smooth dependence for L∞ perturbation of the marginals. Quantitative convergence of Sinkhorn, still an active area (bounded costs, multi-marginals C., unbounded costs: Léger, Nutz, Eckstein...). /9
) ∈ Ck this solution: the Schrödinger map which maps marginals to the Schrödinger potentials (with a suitable normalization or equivalently by quotienting by ∼). /10
the Schrödinger potentials between µt 1 and µt 2 , we then have Theorem 1 (C., Chizat, Laborde, 2022) For p, k ∈ N∗, p ≤ k, if c ∈ Ck+p(X1 × X2 ) then the parametrized Schrödinger map t → φt = S(µt) belongs to Cp([0, 1]; Ck). Moreover, there exists C > 0 that only depends on c Ck+1 such that φt − φs Ck ≤ C|t − s| cost(γ1 , γ2 ) where cost(γ1 , γ2 ) := 2 i=1 |xi − yi |2dγi (xi , yi ) is the L2 transport cost associated with the plans γ1 , γ2 . Displacement smoothness of EOT/2
if c ∈ Ck+1 then for some C > 0 one has S(µ) − S(ν) Ck ≤ CW2 (µ, ν). We actually prove the previous theorem for the more general multi-marginal case (mainly at the expense of cumbersome notations). Sketch of proof of Thm 1, G(t, φ) := T(φ, µt): essentially the Implicit Function Theorem for G. First step: invertibility of ∂φ G. Displacement smoothness of EOT/3
+ ψ2 (x2 ))2dQ(x1 , x2 ) = 0 which in the end implies (ψ1 , ψ2 ) ∼ 0. This shows ∂φ G is one to one, by the Fredholm alternative one obtains invertibility, one has to work a bit to bound the operator norm (in Ck) of its inverse at φt, and to bound derivatives of G with respect to t in terms of cost(γ). Displacement smoothness of EOT/6
V a functional of probability measures, since the seminal work of Jordan-Kinderlehrer and Otto, the evolution equation ∂t ρ = div ρ∇ δV δρ (ρ) , ρ(0, .) = ρ0 (with no flux boundary conditions) can be seen as the gradient flow of V for the 2-Wasserstein metric. Ambrosio-Gigli-Savaré’s green book, key to well posedness is the displacement semi-convexity of V : V (µt ) ≤ (1 − t)V (µ) + tV (ν) + λ 2 t(1 − t)W2 2 (µ, ν) where µt , t ∈ [0, 1] is the displacement interpolation (via an optimal plan) between µ and ν. Application to gradient flows/1
, µ2 ) := inf γ∈Π(µ1,µ2) cdγ + H(γ|µ1 ⊗ µ2 ) if c is C2, and µt is a (not necessarily optimal) displacement interpolation between µ0 and µ1 using the plans γ = (γ1 , γ2 ), we deduce from Theorem 1 that E is "displacement C1,1" i.e. semi-convex and semi-concave: (1 − t)E(µ0) + tE(µ1) − Ccost(γ)t(1 − t) 2 ≤ E(µt) ≤ (1 − t)E(µ0) + tE(µ1) + Ccost(γ)t(1 − t) 2 for some C > 0 depending on c C2 , and the Schrödinger map S(µ) is the gradient of E. Application to gradient flows/2
with H. Malamut. The semigeostrophic equations are a simple model used in meteorology to describe large scale atmospheric flows. Goes back to the work of Eliassen in 1948, Hoskins 1975, revided since the 1980’s with the works of Mike Cullen. Lots of interest in the last 25 years due to connections with OT and Monge-Ampère equations, Benamou and Brenier 1998, Gangbo and Cullen 2001, Loeper 2005, Ambrosio, Colombo, De Philippis and Figalli 2014. Entropic semi-geostrophic equations/1
subset of R3, and α0 a Borel measure on R3 with total mass |Ω| the semi-geostrophic system reads as the coupling of ∂t α + div(αJ(id −∇ψ)) = 0, α(0, .) = α0 , J := 0 −1 0 1 0 0 0 0 0 (2) with the Monge-Ampère equation det(D2ψt ) = αt , ψt convex. (3) Which has to be understood in some suitable weak-sense using optimal transport: by Brenier’s theorem, ∇ψt is the quadratic OT map between αt and the uniform measure µ0 on Ω. Existence shown by Benamou and Brenier. Entropic semi-geostrophic equations/2
∂t α+div(αA(id −∇ψ)) = 0, α(0, .) = α0 , det(D2ψt )µ0 (∇ψt ) = αt , (4) with ψ convex. For d = 3, µ0 the uniform probability measure on Ω and A = J, we recover the initial problem (2)-(3) after normalizing all measures by dividing them by |Ω|. Idea: view ∇u = id −∇ψ as the conditional expectation of x − y given x for an optimal transport plan γ. Entropic semi-geostrophic equations/3
→ αt such that for every f ∈ C1 c ([0, T] × Rd), one has T 0 Rd [∂t f + Ax · ∇f]αt (dx)dt − T 0 Rd×Rd Ay · ∇f(t, x)γt (dx, dy)dt = Rd f(T, x)αT (dx) − Rd f(0, x)α0 (dx), (5) where γt is an optimal plan between αt and µ0 i.e. γt ∈ Π(αt , µ0 ) and W2 2 (αt , µ0 ) = Rd×Rd |x − y|2γt (dx, dy), for a.e. t ∈ [0, T]. (6) Entropic semi-geostrophic equations/4
0, idea is to replace optimal plans γ for W2 by optimal entropic plan γε, α and µ compactly supported, consider OTε (α, µ) := inf γ∈Π(α,µ) 1 2 Rd×Rd |x−y|2γ(dx, dy)+εH(γ|α⊗µ) (7) The unique optimal plan γε for OTε (α, µ) has the Gibbs form γε(dx, dy) = exp − |x − y|2 2ε + uε(y) + vε(x) ε α(dx)µ(dy) (8) where the potentials uε and vε are such that γε ∈ Π(α, µ) i.e. satisfy the Schrödinger system. Entropic semi-geostrophic equations/5
> 0 of (4) ∂t αε + div(αεA(∇vε)) = 0 in [0, T] × Rd, αε(0, .) = α0 , (12) where ∇vε t (x) = x − Rd yγε t (dy|x) (13) and γε t is the solution of OTε(αε t , µ0 ) i.e. γε t ∈ Π(αε t , µ0 ) and OTε (αε t , µ0 ) = 1 2 Rd×Rd |x − y|2γε t (dx, dy) + εH(γε t |αε t ⊗ µ0 ). (14) It follows from Theorem 1 that (the smooth map) ∇vε depends in a W2 -Lipschitz way on αε. Entropic semi-geostrophic equations/7
∈ Pc (Rd) solving/approximating (15) is quite straightforward, standard Cauchy Lipschitz framework. Indeed, rewrite (15) as the fixed-point problem α = Φα0 (α) with Φα0 (α)t := Xα t # α0 , t ∈ [0, T] where Xα t is the (globally well-defined) flow of B[α]: d dt Xα t (x) = B[αt ](Xα t (x)), Xα 0 (x) = x, (t, x) ∈ [0, T]×Rd. (19) for well chosen λ > 0, Φα0 is a contraction for the distance dist(α1, α2) = supt∈[0,T ] e−λtW2 (α1 t , α2 t ). Existence, uniqueness, Lipschitz dependence wrt initial condition. Remarks on continuity equations/3
the drift Bε, i.e. does it satisfy (H1)-(H2)-(H3)? The linear growth condition (H1) is obvious with a constant independent of ε, follows from (10) and the fact that spt γε(.|x) lies in BR0 . The Lipschitz (in x) requirement (H2) follows from (11) with a constant K ∼ R2 0 ε−1. We deduce from the displacement smoothness result that Bε satisfies (H3) (but with a very bad constant M ∼ e−Aε−1 ). Remarks on continuity equations/4
αε + div(αεBε[αε]) = 0, αε(0, .) = α0 with Bε[α] = J(∇vε), vε Schrödinger potential between α and µ0 . by our general considerations on nice continuity equations, we deduce Theorem 2 For ε > 0, (12)-(13)-(14) admits a unique solution αε. Note that we have not used the Hamiltonian structure of SGε (SGε enters the Hamiltonian framework of Ambrosio and Gangbo). Conservation of energy is easy to see OTε (αε, µ0 ) is constant in time (and the vertical marginal of αε is of course constant as well). Remarks on continuity equations/5
0 Not difficult to show that cluster points of solutions of SGε solve SG but of little practical use. Time and space discretization by an explicit Euler scheme proposed by Benamou, Cotter and Malamut. Consider a time step τε > 0 with T = Nε τε and a quantized approximation of the initial α0 ∈ P(BR0 ) αε 0 := 1 Mε Mε i=1 δxε i , xε i ∈ BR0 and assume that τε + W2 (α0 , αε 0 ) → 0, as ε → 0+. (20) Convergence as ε → 0 /1
measures t ∈ [0, T] → αε t by the explicit Euler scheme i.e.: αε t = αε k , t ∈ [kτε , (k + 1)τε ), k = 0, . . . , Nε − 1 with αε 0 = αε 0 , αε k+1 = (id +τε Bε[αε k ])# αε k , k = 0, . . . , Nε − 1 with Bε defined through OTε (α, µ0 ) as before. One can also quantize µ0 , computation of Bε[αε k ] by Sinkhorn. Convergence as ε → 0 /2
t , αε s ) ≤ κ(|t − s| + τ) for every t, s in [0, T] and a constant κ independent of ε, passing along a suitable vanishing sequence εn → 0, we may assume that for some α = (αt )t∈[0,T ] ∈ C([0, T], (P(BRT ), W2 )) (with RT := 2R0 eT ) one has sup t∈[0,T ) W2 (αε t , αt ) → 0 as ε → 0+, (21) Cluster points of the previous approximations are weak solutions of the initial semi-geostrophic equations: Theorem 3 If α is obtained as a cluster point of the discretized entropic regularization (αε t )t∈[0,T ) i.e. (21) holds, then α is a weak solution of (2)-(3). Convergence as ε → 0 /3
curve of plans t → γε t defined by γε t = γε k for t ∈ [kτε , (k + 1)τε ), recalling that Nε τε = 0, we thus have T 0 R3 ∂t fαε t = T 0 R6 ∇f(t, x) · J(y − x)γε t (dx, dy)dt + R3 f(T, x)αT (dx) − Rd f(0, x)α0 (dx) + o(1). (23) assume (possibly after an extraction) that γε t (dx, dy) ⊗ dt weakly ∗ converge as ε → 0+ to some measure of the form γt (dx, dy) ⊗ dt. Convergence as ε → 0 /6
∂t fα = T 0 R6 ∇f(t, x) · J(y − x)γt (dx, dy)dt + R3 f(T, x)αT (dx) − Rd f(0, x)α0 (dx). and γt ∈ Π(αt , µ0 ) for a.e. t. So to show that α is a weak solution of SG it remains to check γt is an optimal plan. Convergence as ε → 0 /7
OTε (αε t , µ0 ) ≤ 1 2 W2 2 (αε t , µ0 ) + κ′ε| log(ε))| for some constant κ′ that does not depend on ε > 0 and t ∈ [0, T]. We thus have T 0 R3×R3 1 2 |x − y|2γt (dx, dy)dt = lim ε T 0 R3×R3 1 2 |x − y|2γε t (dx, dy)dt ≤ lim sup ε T 0 OTε (αε t , µ0 )dt ≤ lim sup ε T 0 1 2 W2 2 (αε t , µ0 )dt = T 0 1 2 W2 2 (αt , µ0 )dt which shows optimality of γt for a.e. t. Convergence as ε → 0 /8