Guillaume Carlier (Université Paris Dauphine - PSL, France) Displacement Smoothness of Entropic Optimal Transport and Applications to some Evolution Equations and Systems

Slide 1

Slide 1 text

1 Displacement smoothness of entropic optimal transport and applications to some evolution equations Guillaume Carlier a Based on joint works with Lénaïc Chizat and Maxime Laborde (2022) and Hugo Malamut (in progress). Workshop on Optimal Transport from Theory to Applications, Interfacing Dynamical Systems, Optimization, and Machine Learning, Berlin, march 2024. aCEREMADE, Université Paris Dauphine and MOKAPLAN (Inria- Dauphine). /1

Slide 2

Slide 2 text

Introduction 2 Introduction Given c ∈ C(Rd × Rd), X1 and X2 convex compact subsets of Rd and µ1 , µ2 compactly supported probability measures on X1 and X2 respectively, the optimal transport problem of Monge and Kantorovich reads inf γ∈Π(µ1,µ2 ) X1×X2 c(x1 , x2 ) γ(dx1 , dx2 ) (1) where Π(µ1 , µ2 ) is the set of probability measures on X := X1 × X2 having µ1 and µ2 as marginals. /2

Slide 3

Slide 3 text

Introduction 3 Its entropic regularization inf γ∈Π(µ1,µ2) X1×X2 c(x1 , x2 ) γ(dx1 , dx2 ) + εH(γ|µ1 ⊗ µ2 ) where H stands for relative entropy, is known to be much more tractable (uniqueness, regularity, eﬃcient computation by Sinkhorn algorithm...) and precise convergence results to the initial OT problem are by now quite well understood. Success of Sinkhorn’s algorithm (Cuturi, Peyré), connection with large deviations, stochastic control, Schrödinger bridges (Léonard, Mikami, Dawson and Gartner...). /3

Slide 4

Slide 4 text

Introduction 4 It is well-known the unique optimal entropic plan is of the form γε (dx1 , dx2 ) = eφ1(x1)+φ2(x2)−c(x1,x2) ε µ1 (dx1 ) ⊗ µ2 (dx2 ) where the (so-called Schrödinger) potentials φ1 and φ2 are implicitly determined by the marginal constraints γε ∈ Π(µ1 , µ2 ). /4

Slide 5

Slide 5 text

Introduction 5 The marginal constraints lead to the following system (Schrödinger system) for the potentials 1 = X2 eφ1(x1)+φ2(x2)−c(x1,x2) ε µ2 (dx2 ) for µ1 -a.e. x1 = X1 eφ1(x1)+φ2(x2)−c(x1,x2) ε µ1 (dx1 ) for µ2 -a.e. x2 Note that this system is invariant by (φ1 , φ2 ) → (φ1 + λ, φ2 − λ). /5

Slide 6

Slide 6 text

Introduction 6 Sinkhorn (aka Iterated proportional ﬁtting procedure or... Gauss Seidel) iterations φt+1 1 (x1 ) = −ε log X2 eφt 2(x2)−c(x1,x2) ε µ2 (dx2 ) and φt+1 2 (x2 ) = −ε log X1 eφ t+1 1 (x1)−c(x1,x2) ε µ1 (dx1 ) . /6

Slide 7

Slide 7 text

Introduction 7 Taking ε = 1, to simplify this rewrites as T(φ, µ) = 0 where φ = (φ1 , φ2 ), µ = (µ1 , µ2 ) and T = (T1 , T2 ) whose components are given by x1 → T1 (φ, µ)(x1 ) := φ1 (x1 ) + log X2 eφ2(x2)−c(x1,x2)µ2 (dx2 ) and x2 → T2 (φ, µ)(x2 ) := φ2 (x2 )+log X1 eφ1(x1)−c(x1,x2)µ1 (dx1 ) . /7

Slide 8

Slide 8 text

Introduction 8 Assume c ∈ Ck(X1 × X2 ) (i.e. c has a Ck extension over Rd × Rd), quotient Ck(X1 ) × Ck(X2 ) by the equivalence relation φ ∼ ψ if there is a constant λ such that φ1 = ψ1 + λ and φ2 = ψ2 − λ and then view T(., µ) as a self map of (the Banach space) Ck = Ck(X1 ) × Ck(X2 )/ ∼. /8

Slide 9

Slide 9 text

Introduction 9 Given µ the existence and uniqueness of φ ∈ Ck ﬁtting the marginal constraints i.e. such that T(φ, µ) = 0 was established (by variational or ﬁxed point arguments, using the Hilbert metric) by Borwein, Lewis and Nussbaum (1994). More elementary proof and extension to the multi-marginal case by using Sinkhorn algorithm by Gerolin and Di Marino (2019). Local/global inversion arguments by C. and Laborde (2020), smooth dependence for L∞ perturbation of the marginals. Quantitative convergence of Sinkhorn, still an active area (bounded costs, multi-marginals C., unbounded costs: Léger, Nutz, Eckstein...). /9

Slide 10

Slide 10 text

Introduction 10 Denote then by S(µ) = (φ1 , φ2 ) ∈ Ck this solution: the Schrödinger map which maps marginals to the Schrödinger potentials (with a suitable normalization or equivalently by quotienting by ∼). /10

Slide 11

Slide 11 text

Outline 11 Outline ➀ Displacement smoothness of EOT ➁ Gradient ﬂows ➂ Entropic semi-geostrophic equations ➃ Remarks on continuity equations ➄ Convergence as ε → 0 /11

Slide 12

Slide 12 text

Displacement smoothness of EOT 12 Displacement smoothness of EOT Equip P(X1 ) × P(X2 ) with the 2-Wasserstein product distance: W2 2 (µ, ν) := W2 2 (µ1 , ν1 ) + W2 2 (µ2 , ν2 ) where W2 2 (µ, ν) := inf γ∈Π(µ,ν) |x − y|2dγ(x, y). Let γi ∈ Π(µi , νi ) (not necessarily optimal) for i = 1, 2, and for t ∈ [0, 1] deﬁne the displacement interpolation µt := (µt 1 , µt 2 ) between (µ1 , µ2 ) and (ν1 , ν2 ) by Xi fdµt i := Xi ×Xi f((1 − t)xi + tyi )dγi (xi , yi ). for every f ∈ C(Xi ). Displacement smoothness of EOT/1

Slide 13

Slide 13 text

Displacement smoothness of EOT 13 Letting φt := S(µt) be the Schrödinger potentials between µt 1 and µt 2 , we then have Theorem 1 (C., Chizat, Laborde, 2022) For p, k ∈ N∗, p ≤ k, if c ∈ Ck+p(X1 × X2 ) then the parametrized Schrödinger map t → φt = S(µt) belongs to Cp([0, 1]; Ck). Moreover, there exists C > 0 that only depends on c Ck+1 such that φt − φs Ck ≤ C|t − s| cost(γ1 , γ2 ) where cost(γ1 , γ2 ) := 2 i=1 |xi − yi |2dγi (xi , yi ) is the L2 transport cost associated with the plans γ1 , γ2 . Displacement smoothness of EOT/2

Slide 14

Slide 14 text

Displacement smoothness of EOT 14 Note the obvious corollary that if c ∈ Ck+1 then for some C > 0 one has S(µ) − S(ν) Ck ≤ CW2 (µ, ν). We actually prove the previous theorem for the more general multi-marginal case (mainly at the expense of cumbersome notations). Sketch of proof of Thm 1, G(t, φ) := T(φ, µt): essentially the Implicit Function Theorem for G. First step: invertibility of ∂φ G. Displacement smoothness of EOT/3

Slide 15

Slide 15 text

Displacement smoothness of EOT 15 Invertibility of ∂φ G derivative with respect to φ, start with ψ1 , ψ2 in its null space i.e. solving the linearized system 0 = ψ1 (x1 ) + X2 ψ2 (x2 )e−c(x1,x2)+φ2(x2)dµt 2 (x2 ) X2 e−c(x1,x2)+φ2(x2)dµt 2 (x2 ) 0 = ψ2 (x2 ) + X1 ψ1 (x1 )e−c(x1,x2)+φ1(x1)dµt 1 (x1 ) X1 e−c(x1,x2 )+φ1 (x1 )dµt 1 (x1 ) convenient to rewrite it in terms of conditional expectations wrt the probability Q := αe−c(x1,x2)+φ2(x2)+φ1(x1)µt 1 ⊗ µt 2 . Displacement smoothness of EOT/4

Slide 16

Slide 16 text

Displacement smoothness of EOT 16 The linearized system becomes ψ1 (x1 ) + X2 ψ2 (x2 )dQ2 (x2 |x1 ) = 0 and ψ2 (x2 ) + X1 ψ1 (x1 )dQ1 (x1 |x2 ) = 0 multiplying the ﬁrst equation by ψ1 (x1 ) and integrating wrt Q1 we get X1 ψ2 1 (x1 )dQ1 (x1 ) + X1×X2 ψ1 (x1 )ψ2 (x2 )dQ(x1 , x2 ) = 0 and in a similar fashion X2 ψ2 2 (x2 )dQ2 (x2 ) + X1×X2 ψ1 (x1 )ψ2 (x2 )dQ(x1 , x2 ) = 0 Displacement smoothness of EOT/5

Slide 17

Slide 17 text

Displacement smoothness of EOT 17 So X1×X2 (ψ1 (x1 ) + ψ2 (x2 ))2dQ(x1 , x2 ) = 0 which in the end implies (ψ1 , ψ2 ) ∼ 0. This shows ∂φ G is one to one, by the Fredholm alternative one obtains invertibility, one has to work a bit to bound the operator norm (in Ck) of its inverse at φt, and to bound derivatives of G with respect to t in terms of cost(γ). Displacement smoothness of EOT/6

Slide 18

Slide 18 text

Application to gradient flows 18 Application to gradient flows Given V a functional of probability measures, since the seminal work of Jordan-Kinderlehrer and Otto, the evolution equation ∂t ρ = div ρ∇ δV δρ (ρ) , ρ(0, .) = ρ0 (with no flux boundary conditions) can be seen as the gradient flow of V for the 2-Wasserstein metric. Ambrosio-Gigli-Savaré’s green book, key to well posedness is the displacement semi-convexity of V : V (µt ) ≤ (1 − t)V (µ) + tV (ν) + λ 2 t(1 − t)W2 2 (µ, ν) where µt , t ∈ [0, 1] is the displacement interpolation (via an optimal plan) between µ and ν. Application to gradient flows/1

Slide 19

Slide 19 text

Application to gradient ﬂows 19 EOT cost E(µ) = E(µ1 , µ2 ) := inf γ∈Π(µ1,µ2) cdγ + H(γ|µ1 ⊗ µ2 ) if c is C2, and µt is a (not necessarily optimal) displacement interpolation between µ0 and µ1 using the plans γ = (γ1 , γ2 ), we deduce from Theorem 1 that E is "displacement C1,1" i.e. semi-convex and semi-concave: (1 − t)E(µ0) + tE(µ1) − Ccost(γ)t(1 − t) 2 ≤ E(µt) ≤ (1 − t)E(µ0) + tE(µ1) + Ccost(γ)t(1 − t) 2 for some C > 0 depending on c C2 , and the Schrödinger map S(µ) is the gradient of E. Application to gradient ﬂows/2

Slide 20

Slide 20 text

Application to gradient flows 20 Examples: • well-posedness of the gradient flow of ρ → E(ρ, µ) (with fixed µ): ∂t ρ = div(ρ∇S1 (ρ, µ)). • (possibly non linear) diffusive version m > 1, α ≥ 0 ∂t ρ = div(ρ∇S1 (ρ, µ)) + α∆ρm. • Sinkhorn divergence ρ → E(ρ, µ) − 1 2 E(ρ, ρ) − 1 2 E(µ, µ) (µ fixed) ∂t ρ = div ρ(∇S1 (ρ, µ)− 1 2 (∇S1 (ρ, ρ)+∇S2 (ρ, ρ)))+α∆ρm. Application to gradient flows/3

Slide 21

Slide 21 text

Application to gradient ﬂows 21 Works for systems such as ∂t ρi = div(ρ∇Si (ρ1 , · · · , ρN )) + αi ∆ρmi i . Particular case ∂t ρi = div(ρ∇Si (ρ1 , · · · , ρN )) + ∆ρi . Exponential (log Sobolev) convergence to the marginals of γ∗ := e−c, γt optimal entropic plan between the marginals ρt i , H(γt|γ∗) ≤ Ce−κt so by Talagrand’s transport inequality W2 2 (γt, γ∗) ≤ Ce−κt. Application to gradient ﬂows/4

Slide 22

Slide 22 text

Entropic semi-geostrophic equations 22 Entropic semi-geostrophic equations Work in progress with H. Malamut. The semigeostrophic equations are a simple model used in meteorology to describe large scale atmospheric ﬂows. Goes back to the work of Eliassen in 1948, Hoskins 1975, revided since the 1980’s with the works of Mike Cullen. Lots of interest in the last 25 years due to connections with OT and Monge-Ampère equations, Benamou and Brenier 1998, Gangbo and Cullen 2001, Loeper 2005, Ambrosio, Colombo, De Philippis and Figalli 2014. Entropic semi-geostrophic equations/1

Slide 23

Slide 23 text

Entropic semi-geostrophic equations 23 Given Ω a Lipschitz bounded open subset of R3, and α0 a Borel measure on R3 with total mass |Ω| the semi-geostrophic system reads as the coupling of ∂t α + div(αJ(id −∇ψ)) = 0, α(0, .) = α0 , J :=     0 −1 0 1 0 0 0 0 0     (2) with the Monge-Ampère equation det(D2ψt ) = αt , ψt convex. (3) Which has to be understood in some suitable weak-sense using optimal transport: by Brenier’s theorem, ∇ψt is the quadratic OT map between αt and the uniform measure µ0 on Ω. Existence shown by Benamou and Brenier. Entropic semi-geostrophic equations/2

Slide 24

Slide 24 text

Entropic semi-geostrophic equations 24 Consider the slightly more general equation ∂t α+div(αA(id −∇ψ)) = 0, α(0, .) = α0 , det(D2ψt )µ0 (∇ψt ) = αt , (4) with ψ convex. For d = 3, µ0 the uniform probability measure on Ω and A = J, we recover the initial problem (2)-(3) after normalizing all measures by dividing them by |Ω|. Idea: view ∇u = id −∇ψ as the conditional expectation of x − y given x for an optimal transport plan γ. Entropic semi-geostrophic equations/3

Slide 25

Slide 25 text

Entropic semi-geostrophic equations 25 Weak solution t ∈ [0, T] → αt such that for every f ∈ C1 c ([0, T] × Rd), one has T 0 Rd [∂t f + Ax · ∇f]αt (dx)dt − T 0 Rd×Rd Ay · ∇f(t, x)γt (dx, dy)dt = Rd f(T, x)αT (dx) − Rd f(0, x)α0 (dx), (5) where γt is an optimal plan between αt and µ0 i.e. γt ∈ Π(αt , µ0 ) and W2 2 (αt , µ0 ) = Rd×Rd |x − y|2γt (dx, dy), for a.e. t ∈ [0, T]. (6) Entropic semi-geostrophic equations/4

Slide 26

Slide 26 text

Entropic semi-geostrophic equations 26 Entropic semi-geostrophic equations Given ε > 0, idea is to replace optimal plans γ for W2 by optimal entropic plan γε, α and µ compactly supported, consider OTε (α, µ) := inf γ∈Π(α,µ) 1 2 Rd×Rd |x−y|2γ(dx, dy)+εH(γ|α⊗µ) (7) The unique optimal plan γε for OTε (α, µ) has the Gibbs form γε(dx, dy) = exp − |x − y|2 2ε + uε(y) + vε(x) ε α(dx)µ(dy) (8) where the potentials uε and vε are such that γε ∈ Π(α, µ) i.e. satisfy the Schrödinger system. Entropic semi-geostrophic equations/5

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Entropic semi-geostrophic equations 28 Entropic regularization SGε with parameter ε > 0 of (4) ∂t αε + div(αεA(∇vε)) = 0 in [0, T] × Rd, αε(0, .) = α0 , (12) where ∇vε t (x) = x − Rd yγε t (dy|x) (13) and γε t is the solution of OTε(αε t , µ0 ) i.e. γε t ∈ Π(αε t , µ0 ) and OTε (αε t , µ0 ) = 1 2 Rd×Rd |x − y|2γε t (dx, dy) + εH(γε t |αε t ⊗ µ0 ). (14) It follows from Theorem 1 that (the smooth map) ∇vε depends in a W2 -Lipschitz way on αε. Entropic semi-geostrophic equations/7

Slide 29

Slide 29 text

Remarks on continuity equations 29 Remarks on continuity equations Consider a map B : α ∈ Pc (Rd) → B[α] ∈ C(Rd, Rd), the ideal situation to solve/approximate ∂t α + div(αB[α]) = 0, α(0, .) = α0 (15) ... Remarks on continuity equations/1

Slide 30

Slide 30 text

Remarks on continuity equations 30 .. is when B satisﬁes the following properties: • (H1) There exists C > 0 such that |B[α](x)| ≤ C(1 + |x|), ∀(x, α) ∈ Rd × Pc (Rd), (16) • (H2) For every R > 0 KR := sup{Lip(B[α], BR ), α ∈ P(BR )} < +∞, (17) • (H3) For every R > 0 MR := sup spt(αi )⊂BR ,α1=α2 B[α1] − B[α2] L∞(BR ) W2 (α1, α2) < +∞, (18) Remarks on continuity equations/2

Slide 31

Slide 31 text

Remarks on continuity equations 31 Under these assumptions, given α0 ∈ Pc (Rd) solving/approximating (15) is quite straightforward, standard Cauchy Lipschitz framework. Indeed, rewrite (15) as the fixed-point problem α = Φα0 (α) with Φα0 (α)t := Xα t # α0 , t ∈ [0, T] where Xα t is the (globally well-defined) flow of B[α]: d dt Xα t (x) = B[αt ](Xα t (x)), Xα 0 (x) = x, (t, x) ∈ [0, T]×Rd. (19) for well chosen λ > 0, Φα0 is a contraction for the distance dist(α1, α2) = supt∈[0,T ] e−λtW2 (α1 t , α2 t ). Existence, uniqueness, Lipschitz dependence wrt initial condition. Remarks on continuity equations/3

Slide 32

Slide 32 text

Remarks on continuity equations 32 Can we apply this to the drift Bε, i.e. does it satisfy (H1)-(H2)-(H3)? The linear growth condition (H1) is obvious with a constant independent of ε, follows from (10) and the fact that spt γε(.|x) lies in BR0 . The Lipschitz (in x) requirement (H2) follows from (11) with a constant K ∼ R2 0 ε−1. We deduce from the displacement smoothness result that Bε satisﬁes (H3) (but with a very bad constant M ∼ e−Aε−1 ). Remarks on continuity equations/4

Slide 33

Slide 33 text

Remarks on continuity equations 33 Back to SGε : ∂t αε + div(αεBε[αε]) = 0, αε(0, .) = α0 with Bε[α] = J(∇vε), vε Schrödinger potential between α and µ0 . by our general considerations on nice continuity equations, we deduce Theorem 2 For ε > 0, (12)-(13)-(14) admits a unique solution αε. Note that we have not used the Hamiltonian structure of SGε (SGε enters the Hamiltonian framework of Ambrosio and Gangbo). Conservation of energy is easy to see OTε (αε, µ0 ) is constant in time (and the vertical marginal of αε is of course constant as well). Remarks on continuity equations/5

Slide 34

Slide 34 text

Convergence as ε → 0 34 Convergence as ε → 0 Not diﬃcult to show that cluster points of solutions of SGε solve SG but of little practical use. Time and space discretization by an explicit Euler scheme proposed by Benamou, Cotter and Malamut. Consider a time step τε > 0 with T = Nε τε and a quantized approximation of the initial α0 ∈ P(BR0 ) αε 0 := 1 Mε Mε i=1 δxε i , xε i ∈ BR0 and assume that τε + W2 (α0 , αε 0 ) → 0, as ε → 0+. (20) Convergence as ε → 0 /1

Slide 35

Slide 35 text

Convergence as ε → 0 35 Piecewise constant curve of measures t ∈ [0, T] → αε t by the explicit Euler scheme i.e.: αε t = αε k , t ∈ [kτε , (k + 1)τε ), k = 0, . . . , Nε − 1 with αε 0 = αε 0 , αε k+1 = (id +τε Bε[αε k ])# αε k , k = 0, . . . , Nε − 1 with Bε deﬁned through OTε (α, µ0 ) as before. One can also quantize µ0 , computation of Bε[αε k ] by Sinkhorn. Convergence as ε → 0 /2

Slide 36

Slide 36 text

Convergence as ε → 0 36 Observing that W2 (αε t , αε s ) ≤ κ(|t − s| + τ) for every t, s in [0, T] and a constant κ independent of ε, passing along a suitable vanishing sequence εn → 0, we may assume that for some α = (αt )t∈[0,T ] ∈ C([0, T], (P(BRT ), W2 )) (with RT := 2R0 eT ) one has sup t∈[0,T ) W2 (αε t , αt ) → 0 as ε → 0+, (21) Cluster points of the previous approximations are weak solutions of the initial semi-geostrophic equations: Theorem 3 If α is obtained as a cluster point of the discretized entropic regularization (αε t )t∈[0,T ) i.e. (21) holds, then α is a weak solution of (2)-(3). Convergence as ε → 0 /3

Slide 37

Slide 37 text

Convergence as ε → 0 37 Let f ∈ C1([0, T] × R3), observe that T 0 R3 ∂t fαε t = Nε −1 k=0 R3 (k+1)τε kτε ∂t fαε k = Nε −1 k=0 Rd (f((k + 1)τε , .) − f(kτε , .))αε k = Nε −1 k=1 R3 f(kτε , .)(αε k−1 − αε k ) + R3 f(T, .)αε N−1 − R3 f(0, .)αε 0 (22) Convergence as ε → 0 /4

Slide 38

Slide 38 text

Convergence as ε → 0 38 Setting Bε k = B[αε k ] and denoting by γε k−1 the solution of OTε (αε k−1 , µ0 ) and using the fact that αε k = (id +τε Bε k )# αε k−1 , enables to rewrite R3 f(kτε , .)(αε k−1 − αε k ) = R3 (f(kτε , .) − f(kτε , id +τε Bε k−1 )αε k−1 = −τε R3 ∇f(kτε , x) · Bε k−1 (x)αε k−1 (dx) + o(τε ) = kτε (k−1)τε R6 ∇f(t, x) · J(y − x)γε k−1 (dx, dy)dt + o(τε ). Convergence as ε → 0 /5

Slide 39

Slide 39 text

Convergence as ε → 0 39 Considering the piecewise constant curve of plans t → γε t deﬁned by γε t = γε k for t ∈ [kτε , (k + 1)τε ), recalling that Nε τε = 0, we thus have T 0 R3 ∂t fαε t = T 0 R6 ∇f(t, x) · J(y − x)γε t (dx, dy)dt + R3 f(T, x)αT (dx) − Rd f(0, x)α0 (dx) + o(1). (23) assume (possibly after an extraction) that γε t (dx, dy) ⊗ dt weakly ∗ converge as ε → 0+ to some measure of the form γt (dx, dy) ⊗ dt. Convergence as ε → 0 /6

Slide 40

Slide 40 text

Convergence as ε → 0 40 Obviously T 0 R3 ∂t fα = T 0 R6 ∇f(t, x) · J(y − x)γt (dx, dy)dt + R3 f(T, x)αT (dx) − Rd f(0, x)α0 (dx). and γt ∈ Π(αt , µ0 ) for a.e. t. So to show that α is a weak solution of SG it remains to check γt is an optimal plan. Convergence as ε → 0 /7

Slide 41

Slide 41 text

Convergence as ε → 0 41 It is known that OTε (αε t , µ0 ) ≤ 1 2 W2 2 (αε t , µ0 ) + κ′ε| log(ε))| for some constant κ′ that does not depend on ε > 0 and t ∈ [0, T]. We thus have T 0 R3×R3 1 2 |x − y|2γt (dx, dy)dt = lim ε T 0 R3×R3 1 2 |x − y|2γε t (dx, dy)dt ≤ lim sup ε T 0 OTε (αε t , µ0 )dt ≤ lim sup ε T 0 1 2 W2 2 (αε t , µ0 )dt = T 0 1 2 W2 2 (αt , µ0 )dt which shows optimality of γt for a.e. t. Convergence as ε → 0 /8