Wasserstein Gradient Flows of MMD Functionals with Distance Kernel and Cauchy Problems on Quantile Functions

Slide 1

Slide 1 text

Wasserstein Gradient Flows of MMD Functionals with Distance Kernel and Cauchy Problems on Quantile Functions joint work with Richard Duong, TU Berlin Robert Beinert, TU Berlin Johannes Hertrich, UCL Gabriele Steidl, TU Berlin University of South Carolina, Columbia, 30.08.2024. Joint ACM and RTG data science seminar (Changhui Tan, Siming He, Wuchen Li).

Slide 2

Slide 2 text

Motivation - Discrepancy minimization Problem: Target measure ν ∈ P2 (Rd) is unknown, only samples are available. Goal: Recover ν. Solution: minimize the metric kernel discrepancy Fν := MMDK (·, ν)2 : P2 (R d) → [0, ∞) (which can be estimated using samples) to ν by finding curve of measures (γt )t>0 ⊂ P2 (Rd) along which Fν decreases “the fastest”. ©Francis Bach One way to construct (γt )t>0: Wasserstein gradient flows. Different kernels K ⇝ very different behavior of (γt )t>0. Aim of this preprint: study behavior of (γt )t>0 for irregular kernel K(x, y) := −|x − y| for d = 1. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 2 / 22

Slide 3

Slide 3 text

1. Optimal transport and Wasserstein gradient flow ©Petr Mokrov 2. Maximal monotone inclusions on Hilbert spaces (I + λ∂F)−1 3. Negative distance kernel, Max. Mean Discrepancy https://www.mia.uni-saarland.de/Research/IP_Halftoning.shtml 4. Wasserstein gradient flow of the MMD with negative distance kernel 5. Invariant subsets & smoothing properties γ0 = δx , γt ∼ U[at , bt ], t > 0. 6. Numerical results 8 6 4 2 0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Iteration 0 initial target explicit implicit −1 −0.5 0.5 1 1.5 2 1 2 3 µ0

Slide 4

Slide 4 text

Probability measures on the real line - CDF and quantile function The cumulative distribution functions (CDFs) of ν ∈ P(R) are given by R+ ν : R → [0, 1], R+ ν (x) := ν (−∞, x] , R− ν : R → [0, 1], R− ν (x) := ν (−∞, x) . and its (generalized inverse, the) quantile function by Qν : (0, 1) → R, Qν (s) := min{x ∈ R : R+ ν (x) ≥ s}. s x W1 W2 W3 1 x1 x2 x3 x4 s x W1 W2 W3 1 x1 x2 x3 x4 s W1 W2 W3 1 0 x1 x2 x3 x4 Left to right: R+ ν , R− ν , Qν , for ν = 4 k=1 wkδxk , Wj := j k=1 wk. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 4 / 22

Slide 5

Slide 5 text

Wasserstein-2 space on the real line Consider the subset of probability measures P(R), P2 (R) := µ ∈ P(R) : R x2 dµ(x) < ∞ . On P2 (R), the Wasserstein-2 metric is W2 (µ, ν)2 = min π∈Γ(µ,ν) R × R (x − y)2 dπ(x, y), µ, ν ∈ P2 (R), Γ(µ, ν) := {π ∈ P(R × R) : (P1 )# π = µ, (P2 )# π = ν} with the projections Pi (x1 , x2 ) := xi. The push-forward # acts as (f# σ)(A) := σ(f−1(A)) for all measurable sets A ⊂ R. The optimal plan ˆ π is unique (in 1D). Push-forward ©Gabriel Péyre dsweber2.github.io/Optimal-Transport-Information-Geometry/ ˆ π realizes monotone rearrangement. juliaoptimaltransport.github.io/OptimalTransport.jl/dev/examples/OneDimension/ Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 5 / 22

Slide 6

Slide 6 text

Manifold-like structure of P2 (R) - Geodesics and geodesic convexity Geodesic (= shortest constant-speed path) between µ and ν is γt := (1 − t)P1 + tP2 # ˆ π, t ∈ [0, 1], (1) Vertical (L2, gt := (1 − t)fµ + tfν ) vs. horizontal (W2) mass displacement. ©Anna Korba Definition (W2-Geodesic convexity) F : P2 (R) → R is convex along geodesics if F(γt ) ≤ (1 − t) F(µ) + t F(ν), ∀µ, ν ∈ P2 (R), γt from (1). Geodesic on a manifold. dx.doi.org/10.1049/iet-rsn.2020.0185 W2 geodesic ©Lénaïc Chizat Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 6 / 22

Slide 7

Slide 7 text

Isometric embedding via quantile functions Theorem (Isometric embedding of (P2 (R), W2 )) Let C(0, 1) := {Qµ : µ ∈ P2 (R)} ⊂ L2 (0, 1) be the set of quantile functions. The map P2 (R) → C(0, 1), µ → Qµ is an isometric isomorphism with inverse Q → Q# Λ(0,1) , where Λ is the Lebesgue measure. Isometric embedding. dx.doi.org/10.1155/IJMMS.2005.2241 KEY IDEA: instead of working with F : P2 (R) → (−∞, ∞], find F : L2 (0, 1) → (−∞, ∞] satisfying F(Qµ ) = F(µ). Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 7 / 22

Slide 8

Slide 8 text

From Euclidean to metric gradient flows Given: Hilbert space H, energy functional f : H → (−∞, ∞] (convex, bounded below). Idea: minimize f ⇝ find an absolutely continuous curve t → x(t) such that t → f(x(t)) decreases “the fastest”. d dt x(t) = −∇f(x(t)) ©Francis Bach If f is not differentiable, there might be more than one descent direction. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 8 / 22

Slide 9

Slide 9 text

Subdifferentials of convex functions are monotone Definition The subdifferential of a convex function F : H → R is ∂F(u) := v ∈ H : F(w) ≥ F(u) + ⟨v, w − u⟩ ∀w ∈ H . ©Pontus Gielsson Theorem (Properties of the resolvent) Let F : H → R be convex and lower semicontinuous, F ̸≡ ∞. Then ∂F is maximal monotone, hence ∀ε > 0, the resolvent J∂F ε := (I + ε∂F)−1 : H → H is single-valued. ∂F, I + λ∂F, (I + λ∂F)−1, where F := | · |. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 9 / 22

Slide 10

Slide 10 text

Maximal monotone inclusions on Hilbert spaces This theorem tells us that gradient flows in Hilbert spaces exist and are unique. Theorem (Existence, regularity of strong solutions to Cauchy problems (Brezis ’67)) Let f : H → R s.t. ∂f : H → 2H is maximal monotone and g0 ∈ dom(∂f). Then ∃!g : [0, ∞) → H s.t. • g(0) = g0 and dg dt (t) ∈ −(∂f) g(t) for almost all t > 0, and g(t) ∈ dom(∂f) for all t > 0 • g is Lipschitz continuous on [0, ∞). • g is given by the exponential formula g(t) = lim n→∞ J∂f t n n (g0 ) uniformly in t on compact intervals. For P2 (R) instead of H: what are the analogs of the tangent vector d dt x(t) and the subdifferential ∂f? Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 10 / 22

Slide 11

Slide 11 text

Wasserstein gradient flows in one dimension Definition (Fréchet subdifferential in Wasserstein space) The (reduced) Fréchet subdifferential of F : P2 (R) → R at µ is ∂ F(µ) := ξ ∈ L2(R; µ) : F(ν) − F(µ) ≥ R × R ξ(x1 )(x2 − x1 ) dˆ π(x, y) + o W2 (µ, ν) A curve γ : (0, ∞) → P2 (R) is absolutely continuous if ∃ L2-Borel velocity field (vt : R → R)t>0 s.t. ∂t γt + ∇ · (vt γt ) = 0, (t, x) ∈ (0, ∞) × R, weakly. (Continuity Eq.) Definition (Wasserstein gradient flow) A absolutely continuous curve γ : (0, ∞) → P2 (R) with velocity field vt ∈ Tγt P2 (R) t>0 is a Wasserstein gradient flow with respect to F : P2 (R) → R if vt ∈ −∂ F(γt ), for a.e. t > 0. Tangent vector. https://personal.math.ubc.ca/~CLP/CLP3/clp_3_mc/sec_curves.html Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 11 / 22

Slide 12

Slide 12 text

Fundamental theorem of Wasserstein gradient flows Theorem (Ambrosio, Gigli, Savaré (2005)) Let F : P2 (R) → R be bounded from below, lower semicontinuous, geodesically convex. Existence and uniqueness. Then ∃! Wasserstein gradient flow γ : (0, ∞) → P2 (R) with respect to F starting at γ(0+) = µ0 ∈ P2 (R). Approximation scheme. The piecewise constant curves constructed from the iterates µn+1 := argmin µ∈P2(Rd) F(µ) + 1 2τ W2 2 (µn , µ) , τ > 0, (Minimizing movement scheme) i.e., γτ defined by γτ |(nτ,(n+1)τ] := µn, n ∈ N, converge locally uniformly to γ as τ ↓ 0. Convergence speed. If ¯ µ is a minimizer of F, then F(γt ) − F(¯ µ) ≤ 1 2t W2 2 (µ0 , ¯ µ). (Sublinear convergence rate) Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 12 / 22

Slide 13

Slide 13 text

Reformulation of WGF as Cauchy Problem We establish a correspondence between L2 (0, 1)-gradient flows of F and Wasserstein gradient flows of F. Theorem (Quantile function reformulation [DSBHS24]) Let F : L2 (0, 1) → (−∞, ∞] be convex and lsc, F ̸≡ ∞ with F(Qµ ) = F(µ) for all µ ∈ P2 (R). Assume J∂F ε maps C(0, 1) into itself ∀ε > 0. Then, ∀ initial datum g0 ∈ C(0, 1) ∩ dom(∂F),      ∂t g(t) + ∂F(g(t)) ∋ 0, t ∈ (0, ∞), g(0) = g0 , (Cauchy Problem) has a unique strong solution g. The curve γt := (g(t))# Λ(0,1) has quantile functions Qγt = g(t) and is a Wasserstein gradient flow of F with γ(0+) = (g0 )# Λ(0,1) . Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 13 / 22

Slide 14

Slide 14 text

Maximum Mean Discrepancy Consider the negative distance kernel K(x, y) := −|x − y|. K is only conditionally positive definite. Motivation: electrostatic principles, interacting species, dithering. Definition (MMD) The maximum mean discrepancy of with respect to K is P2 (R) × P2 (R) → [0, ∞), (µ, ν) → 1 2 MMDK (µ, ν)2 = R × R K(x, y) d(µ − ν)(x) d(µ − ν)(y). Dithering doi.org/10.1137/100790197 Fν (µ) := − 1 2 R × R |x − y| dµ(x) dµ(y) interaction + Rd × Rd |x − y| dµ(x) dν(y) potential . (2) MMD2 is a metric, known also as energy distance. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 14 / 22

Slide 15

Slide 15 text

Properties of Fν We now apply this theorem to the MMD with the negative distance kernel, i.e. to Fν := 1 2 MMDK (·, ν). Lemma (Properties of Fν ) The functional Fν : L2 (0, 1) → R, u → 1 0 (1 − 2s) u(s) − Qν (s) + 1 0 |u(s) − Qν (t)| dt ds is convex and continuous, we have Fν (Qµ ) = Fν (µ) ∀µ ∈ P2 (R), and ∂Fν (u) = f ∈ L2 (0, 1) : f(s) ∈ 2 R− ν u(s) , R+ ν u(s) − 2s for a.e. s ∈ (0, 1) , u ∈ L2 (0, 1) and J∂Fν ε maps C(0, 1) into itself ∀ε > 0. Proof. By elementary means, nothing fancy happens here. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 15 / 22

Slide 16

Slide 16 text

Smoothing properties and invariant subsets The lower Lipschitz constant of g ∈ C(0, 1) is Llow (g) := max L ≥ 0 : g(s1 ) − g(s2 ) s1 − s2 ≥ L ∀s1 , s2 ∈ (0, 1) ≥ 0, If µ = δx, then Qµ ≡ x and Llow (Qµ ) = 0. Theorem (Time evolution of CDF’s lower Lipschitz constants) Let ν ∈ P2 (R) with Llow (Qν ) > 0, and g0 = Qµ0 ∈ C(0, 1). We have Lip(Rγt ) ≤ Llow (g0 ) · e− 2t Llow(Qν ) + Llow (Qν ) · (1 − e− 2t Llow(Qν ) ) −1 < ∞. Theorem (Continuity is preserved & monotonicity of the support) Let ν ∈ P2 (R), g0 be continuous and g the solution of the Cauchy problem starting in g0. • g(t) is continuous for all t ≥ 0. • The ranges fulfill g(t1 )(0, 1) ⊆ g(t2 )(0, 1) for all 0 ≤ t1 ≤ t2. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 16 / 22

Slide 17

Slide 17 text

Explicit formula for discrete ν Corollary (Point measure target) Let ν := n j=1 wj δxj . Then Fν flow is given by [g(t)](s) :=            Qµ0 (s) + 2 (s − Rs,0 ) t, t ∈ [ts,0 , ts,1 ), xs,j + 2 (s − Rs,j ) (t − ts,j ), t ∈ [ts,j , ts,j+1 ), Qν (s), t ≥ ts,|ℓs−ks| , where ts,0 := 0, ts,1 := xs,1 − Qµ0 (s) 2(s − Rs,0) , ts,j+1 := ts,j + xs,j+1 − xs,j 2(s − Rs,j ) , Qµ0 (s) ≤ Qν (s) Qµ0 (s) ≥ Qν (s) ℓs Wℓs−1 < s < Wℓs Wℓs−1 < s < Wℓs ks xks ≤ Qµ0 (s) < xks+1 xks−1 < Qµ0 (s) ≤ xks xs,j xks+j xks−j j ≤ |ℓs − ks | Rs,j Wks+j Wks−j−1 j ≤ |ℓs − ks | − 1 Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 17 / 22 −1 −0.5 0.5 1 1.5 2 1 2 3 µ0

Slide 18

Slide 18 text

Numerical experiments - Implicit Euler (backward) scheme Let τ > 0. The minimizing movement (or JKO) scheme, µn+1 := argmin µ∈P2(Rd) Fν (µ) + 1 2τ W2 2 (µn , µ) , can be rewritten using the isometry P2 (R) → C(0, 1) as gn+1 = argmin g∈C(0,1) Fν (g) + 1 2τ 1 0 |g − gn|2 ds Fν ∈Γ0(L2(0,1)) = (I + τ∂Fν )−1(gn ), which is equivalent to gn (s) + 2τs ∈ gn+1 (s) + 2τ R− ν (gn+1 (s)), R+ ν (gn+1 (s)) (3) for all s ∈ (0, 1). −2 −1 1 −2 −1 1 2 gn (s1 ) + 2τs1 s1 gn+1 (s1 ) s2 gn+1 (s2 ) gn + 2τ id id +2τ[R− ν , R+ ν ] Implicit Euler step visualized. (gn )# Λ(0,1) approximates WGF for t ∈ (nτ, (n + 1)τ]. (gn )n∈N → Qν weakly in L2 (0, 1) and µn := (gn )# Λ(0,1) → ν narrowly for fixed τ > 0. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 18 / 22

Slide 19

Slide 19 text

Numerical experiments - Explicit Euler (forward) scheme If R+ ν = R− ν =: Rν , we can also use explicit Euler discretization gn+1 = gn − τ∇Fν (gn ) = gn − 2τ(Rν ◦ gn − id), Advantage: we don’t have to solve an inclusion in each step. Disadvantage: weaker convergence guarantees, might not preserve C(0, 1) (iterates not monotone). µ0 = N(−5, 1), ν = N(5, 1). Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 19 / 22 8 6 4 2 0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Iteration 0 initial target explicit implicit

Slide 20

Slide 20 text

Further work • Extension to higher dimensions. Lagrangian reformulation now involves flow map. • Deterministic particle approximations and mean field limits. • Convergence of explicit scheme, distance of implicit and explicit iterates. • Generalize Lipschitz properties to Hölder properties. • Applying similar techniques to non-convex functional and then regularize. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 20 / 22

Slide 21

Slide 21 text

Conclusion • Reformulation as maximal monotone inclusion Cauchy problem in L2 (0, 1) via quantile functions. • Comprehensive description of solutions’ behavior, qualitative description of instantaneous measure-to-L∞ regularization. • Implicit Euler is simple. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 21 / 22

Slide 22

Slide 22 text

Thank you for your attention! I am happy to take any questions. Paper link: https://arxiv.org/abs/2408.07498 My website: viktorajstein.github.io [AGS08, Bre73, CL71, JKO98] Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 22 / 22

Slide 23

Slide 23 text

References I [AGS08] L. Ambrosio, N. Gigli, and G. Savare, Gradient flows, 2nd ed., Lectures in Mathematics ETH Zürich, Birkhäuser, Basel, 2008. [Bre73] Haim Brezis, Operateurs maximaux monotones, North-Holland Mathematics Studies, 1973 (French). [CL71] M. G. Crandall and T. M. Liggett, Generation of semi-groups of nonlinear transformations on general banach spaces, American Journal of Mathematics 93 (1971), no. 2, 265–298. [JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto, The variational formulation of the Fokker–Planck equation, SIAM Journal on Mathematical Analysis 29 (1998), no. 1, 1–17. Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 1 / 2

Slide 24

Slide 24 text

Shameless plug: other works Interpolating between OT and KL regularized OT using Rényi Divergences Rényi divergence ̸∈ {f-div., Bregman div.}, α ∈ (0, 1) Rα (µ | ν) := 1 α − 1 ln X dµ dτ α dν dτ 1−α dτ , OTε,α (µ, ν) := min π∈Π(µ,ν) ⟨c, π⟩ + εRα (π | µ ⊗ ν) is a metric, where ε > 0, µ, ν ∈ P(X), X compact. OT(µ, ν) α↘0 ← − − − − or ε→0 OTε,α (µ, ν) α↗1 − − − → OTKL ε (µ, ν). In the works: debiased Rényi-Sinkhorn divergence OTε,α (µ, ν) − 1 2 OTε,α (µ, µ) − 1 2 OTε,α (ν, ν). Viktor Stein W2 Gradient Flows of MMD functionals with Distance Kernel August 30th, 2024 2 / 2